Search this blog

26 March, 2017

A programmer's sightseeing tour: Machine Learning and Deep Neural Networks (part 1)

TL;DR: You probably don't need DNNs, but you ABSOLUTELY SHOULD know and practice data analysis!

This won't be short...

- Machine Learning

Machine learning is a huge field nowadays, with lots of techniques and sub-disciplines. It would be very hard for me to provide an overview in a single article, and I certainly don't claim to know all about it.
The goal of this article is to introduce you to the basic concepts, just enough so we can orient ourselves and understand what we might need in our daily job as programmers.

I'll try to do so using terminology that is as much as possible close to what a programmer might expect instead of the grammar of machine learning which annoyingly often likes to call the same things in different ways based on the specific subdomain.
This is particularly a shame because as we'll soon see, lots of different fields, even disciplines that are not even usually considered to be "machine learning", are really intertwined and closely related.

- Supervised and unsupervised learning

The first thing we have to know is that there are two main kinds of machine learning: supervised and unsupervised learning. 
Both deal with data, or if you wish, functions that we don't have direct access to but that we know through a number of samples of their outputs.

In the case of supervised learning, our data comes in the form of input->output pairs; each point is a vector of the unknown function inputs and it's labeled with the return value.
Our job is to learn a functional form that approximates the data; in other words, through data, we are learning a function that approximates a second unknown one. 
Clearly supervised learning is closely related to function approximation.

Another name for this is regression analysis or function fitting: we want to estimate the relationship between the input and output variables. 
Also related is (scattered) data interpolation and Kriging: in all cases we have some data points and we want to find a general function that underlies them.

Most of the times the actual methods that we use to fit functions to data come from numerical optimization: our model functions have a given number of degrees of freedom, flexibility to take different shapes, optimization is used to find the parameters that make the model as close as possible (minimize the error) to the data.

Function fitting: 1D->1D
If the function's outputs are from a discrete set instead of being real numbers supervised learning is also called classification: our function takes an input and emits a class label (1, 2, 3,... or cats, dogs, squirrels,...), our job is, seen some examples of this classification at work, learn a way to do the same job on inputs that are outside the data set provided.

Binary classifier: 2D->label
For unsupervised learning, on the other hand, the data is just made of points in space, we have no labels, no outputs, just a distribution of samples. 
As we don't have outputs, fitting a function sounds harder, functions are relations of inputs to their outputs. What we could do though is to organize these points to discover relationships among themselves: maybe they form clusters, or maybe they span a given surface (manifold) in their n-dimensional space.

We can see clustering as a way of classifying data without knowing what the classes are, a-priori. We just notice that certain inputs are similar to each other, and we group these in a cluster. 
Maybe later we can observe the points in the cluster and decide that it's made of cats, assign a label a-posteriori.

2D Clustering
Closely related to clustering is dimensionality reduction: if we have points in an n-dimensional space, and we can cluster them in k groups, where k is less than n, then probably we can express each point by saying how close to each group it is (projection), thus using k dimensions instead of n.

2D->1D Projection
Dimensionality reduction is, in turn, closely related to finding manifolds: let's imagine that our data are points in three dimensions, but we observe that they all lie always on the unit sphere.
Without losing any information, we can express them as coordinates on the sphere surface (longitude and latitude), thus having saved one dimension by having noticed that our data lied on a parametric surface.

And (loosely speaking) all the times we can project points to a lower dimension we have in turn found a surface: if we take all the possible coordinates in the lower-dimensionality space they will map to some points of the higher-dimensionality one, generating a manifold.

Interestingly though unsupervised learning is also related to supervised learning in a way: if we think of our hidden, unknown function as a probability density one, and our data points as samples extracted according to said probability, then unsupervised learning really just wants to find an expression of that generating function. This is also the very definition of density estimation!

Finally, we could say that the two are also related to each other through the lens of dimensionality reduction, which can be seen as nothing else than a way to learn an identity function (inputs map to outputs) where we have the constraint that the function, internally, has to loose some information, has to have a bottleneck that ensures the input data is mapped to a small number of parameters.

- Function fitting

Confused yet? Head spinning? Don't worry. Now that we have seen that most of these fields are somewhat related, we can choose just one and look at some examples. 

The idea that most programmers will be most familiar with is function fitting. We have some data, inputs and outputs, and we want to fit a function to it so that for any given input our function has the smallest possible error when compared with the outputs given.

This is commonly the realm of numerical optimization. 

Let's say we suppose our data can be modeled as a line. A line has only two parameters: y=a*x+b, we want to find the values of a and b so that for each data point (x1,y1),(x2,y2)...(xN,yN), our error is minimized, for example, the L2 distance.
This is a very well studied problem, it's called linear regression, and in the way it's posed it's solvable using linear least squares.
Note: if instead of wanting to minimize the distance between the data output and the function output, we want to minimize the distance between the data points and the line itself, we end up with principal component analysis/singular value decomposition, a very important method for dimensionality reduction - again, all these fields are intertwined!

Now, you can imagine that if our data is very complicated, approximating it with a line won't really do much, we need more powerful models. Roughly speaking we can construct more powerful models in two ways: we either use more pieces of something simple, or we start using more complicated pieces.

So, on one extreme we can think of just using linear segments, but using many of them (fitting a piecewise linear curve), on the other hand, we can think instead of fitting higher-order polynomials, or rational function, or even to find an arbitrary function made of any combination of any number of operators (symbolic regression, often done via genetic programming).

Polynomial versus piecewise linear.
The rule of the thumb is that simpler models have usually easier ways to fit (train), but might be wasteful and grow rather large (in terms of the number of parameters). More powerful models might be much harder to fit (global nonlinear optimization), but be more succinct.

- Neural Networks

For all the mystique there is around Neural Networks and their biological inspiration, the crux of the matter is that they are nothing more than a way to approximate functions, rather like many others, but made from a specific building block: the artificial neuron.

This neuron is conceptually very simple. At heart is a linear function: it takes a number of inputs, it multiplies them with a weight vector, it adds them together into a single number (a dot product!) and then it adds a bias value (optionally).
The only "twist" there is that after the linear part is done, a non-linear function (the activation function) is applied to the results.

If the activation function is a step (outputting one if the result was positive, zero otherwise), we have the simplest kind of neuron and the simplest neural classifier (a binary one, only two classes): the perceptron.

In general, we can use many nonlinear functions as activations, depending on the task at hand.
Regardless of this choice though it should be clear that with a single neuron we can't do much, in fact, all we can ever do is express a distance from an hyperplane (again, we're doing a dot product), somewhat modified by the activation. The real power in neural networks come from the "network" part.

The idea is again simple: if we have N inputs, we can connect to them M neurons. These neurons will each give one output, so we end up with M outputs, and we can call this structure a neural "layer".
We can then rinse and repeat, the M outputs can be considered as inputs of a second layer of neurons and so on, till we decide enough is enough and at the final layer we use a number of outputs equal to the ones of the function we are seeking to approximate (often just one, but nothing prevents to learn vector-valued functions).

The first layer, connected to our input data, is unimaginatively called the input layer, the last one is called the output layer, and any layer in between is considered a "hidden" layer. Non-deep neural networks often employ a single hidden layer.

We could write down the entire neural network as a single formula, it would end up nothing more than a nested sequence of matrix multiplies and function applications. In this formula we'll have lots of unknowns, the weights we use in the matrix multiplies. The learning process is nothing else than optimization, we find the best weights that minimize the error of our neural network to the data given.

Because we typically have lots of weights, this is a rather large optimization problem, so typically fast, local, gradient-descent based optimizers are used. The idea is to start with an arbitrary set of weights and then update them by following the function partial derivatives towards a local minimum of the error.

Source. See also this.
We need the partial derivatives for this process to work. It's impractical to compute them symbolically, so automatic differentiation is used, typically via a process called "backpropagation", but other methods could be used as well, or we can even have a mix of methods, using hand-written symbolic derivatives for certain parts where we know how to compute them, and automatic differentiation for other.

Under certain assumptions, it can be shown that a neural network with a single hidden layer is a universal approximator, it could (we might not be able to train it well, though...), with a finite (but potentially large number) of neurons approximate any continuous function on compact subsets of n-dimensional real spaces.

23 February, 2017

Tonemapping on HDR displays. ACES to rule ‘em all?

HDR displays are upon us, and I’m sure rendering engineers worldwide are trying to figure out how to best use them. What to do with post-effects? How to do antialiasing. How to deal with particles and UI. What framebuffer formats to use and so forth.

Well, it appears that in this ocean of new research, some standards are emerging, and one solution that seem to be popular is to use the ACES tone-mapping curve (RRT: Reference Rendering Transform) with an appropriate HDR display curve (ODT: Output Display Transform).

To my dismay though I have to say I’m a bit baffled, and perhaps someone will persuade me otherwise in the future, but I don’t see why ACES would be a solid choice. 

First of all, let’s all be persuaded we indeed need to tone-map our HDR data. Why can’t we just apply exposure and send linear HDR to a TV?
At first, it could seem that should be a reasonable choice: the PQ encoding curve we use to send the signal to TVs peaks at 10.000 nits, which is not too bad, it could allow to encode a scene-referred signal and let the TV do the rest (tone-map according to their characteristics).

This is not what TVs do, though. Leaving the transform from scene values to display would allow for lots of flexibility, but would also give to the display too much responsibility over the final look of the image.
So, the way it works instead is that TVs do have some tone-mapping functionality, but they are quite linear till they reach their peak intensity, where they seem to just have a sharp shoulder.

How sharp that shoulder is can depend, as content can also send along meta-data telling what’s the maximum nits it was authored at: for content that matches the TV, in theory no rolloff is needed at all, as the TV will know the signal will never exceed its abilities (in practice though, said abilities change based on lots of factors due to energy limits).

Some TVs will also expose silly controls, like gamma in HDR: what it seems is that these alter their response curve in the “SDR” range of their output, for now let's ignore all that.
Regardless of these specifics, it's clear that you’re supposed to bring your values from scene-referred to display-referred, and to decide where you want your mid-gray to be, and how to roll highlights from there. You need tone mapping in HDR.

Ok, so let’s backtrack a second. What’s the goal a tone-mapping curve? I think it depends, but you might have one or more of the following goals:
  1. To compress dynamic range in order to best use the available bits. A form of mu-law encoding.
  2. To provide a baseline for authoring. Arguably that should be a “naturalistic”, perceptual rendition of an HDR scene, but it might even be something closer to the final image.
  3. To achieve a given final look, on a given display.
HDR screens add a fourth possible objective creating a curve that makes possible for artists on SDR monitor to easily validate HDR values. 
I'd argue though that this is a fool's errand though, so we won't investigate it. A better and simpler way to author HDR values on SDR monitors is by showing out-of-range warnings and allowing to easily see the scene at various exposures, to check that shadows/ambient-diffuse-highlight-emissive are all in the ranges they should be.

How does ACES fit in all this? 

It surely was not designed with compression in mind (unlike for example, the PQ curve), albeit it might somewhat work, the RRT is meant to be quite “wide” (both in dynamic range and gamut), because it’s supposed to then be further compressed by the ODT. 
Compression really depends on what do you care about and how many bits you have, so a one-size-fits all curve is in general not probably going to cut it. 
Moreover, the RRT is not meant to be easily invertible, much simpler compression curves can be applied, if the goal is to save bits (e.g. to squish the scene into a range that can then be manipulated with the usual color-grading 3d LUTs we are accustomed to).

It wasn’t designed to be particularly perceptually linear either, preserve colors, preserve brightness: the RRT is modelled after film stock.

So we’re left with the third option, a curve that we can use for final output on a display. Well, that’s arguably one of the most important goals, so if ACES does well there, it would be plenty.

At a glance also, that should be really its strength, thanks to the fact that it couples a first compression curve with a second one, specific to a given display (or rather, a display standard and its EOTF: electro-optical transfer function). But here’s the problem. Is it reasonable to tone-map to given output levels, in this situation?

With the old SDR standards (rec.709 / BT.1886) one could think that there was a standard display and viewing ambient we targeted, and that users and TVs would compensate for specific environments. It would have been a lie, but one could have hand-waved things like that (in practice, I think we never really considered the ambient seriously).

In HDR though this is definitely not true, we know different displays will have different peak nits, and we know that the actual amount of dynamic range will vary from something not too different than SDR, in the worst ambients, to something wide enough that could even cause discomfort if too bright areas are displayed for too long. 
ACES itself has different ODTs based on the intended peak nits of the display (and this also couples with the metadata you can specify together with your content).

All this might work, in practice today we don’t have displays that exceed 1000 nits, so we could use ACES, do an ODT to 1000 nits and if we can even send the appropriate meta-data, leaving all the eventual other adjustments to the TV and its user-facing settings. Should we though?
If we know that the dynamic range varies so much, why would we constrain ourselves to a somewhat even complex system that was never made with our specific needs in mind? To me it seems quite a cop-out.

Note: for ACES, targeting a fixed range makes a lot of sense, because really once a film is mastered (e.g. onto a blue-ray) the TM can't change, so all you want to do is make sure what the director saw on the reference screen (that had a given peak nits) matches the output, and that's all left to the metadata+output devices. In games though, we can change TM based on the specific device/ambient... I'm not questioning the virtues of ACES for movies, the RRT even was clearly devised as something that resembles film so that the baseline TM would look like something that movie people are accustomed to.

Tone-mapping as display calibration.

I already wasn't a fan of blindly following film-based curves and looks in SDR, I don’t see why this would be the best for the future as well.
Sure, filmic stocks evolved over many years to look nice, but they are constrained to what is achievable with chemicals on a film…

It is true that these film stocks did define a given visual language we are very accustomed to, but we have much more freedom in the digital world today to exploit.
We can preserve colors much better, we control how much glare we want to add, we can do localized tone-mapping and so on. Not to mention that we got so much latitude with color grading that even if a filmic look is desired, it's probably not worth delegating the responsibility of achieving it to the TM curve!

To me it seems that with HDR displays the main function of the final tone-mapping curve should be to adapt to the variability of end displays and viewing environments, while specific “looks” should be achieved via grading, e.g. with the help of compression curves (like s-log) and grading 3d LUTs.

Wouldn’t it be better, for the final display, to have a curve where it’s easy for the end user to tweak the level at which mid-grays will sit, while independently control how much to roll the highlights based on the capabilities of the TV? Maybe even having two different "toes" for OLED vs LCDs...

I think it would even be easier and safer to even just modify our current tone-mapping curves to give them a control over highlight clipping, while preserving the same look we have in SDR for most of the range.
That might avoid headaches with how much we have to adjust our grading between SDR and HDR targets, while still giving more flexibility when it comes to display/ambient calibration.

HDR brings some new interesting problems, but so far I don't see ACES solving any of them. To me, the first problem, pragmatically, today, is calibration.
A more interesting but less immediate one is how much HDR changes perception, how to use it not just as a special effects for brighter highlights, but to really be able to create displays that look more "transparent" (as in: looking through a window).

Is there a solid reason behind ACES, or are we adopting it just because it’s popular, it was made by very knowledgeable people, and we follow?

Because that might not be the worst thing, to follow blindly, but in the past so many times did lead to huge mistakes that we all committed because we didn’t question what we were doing… Speaking of colors and displays, a few years ago, we were all rendering using non-linear (gamma transformed) values, weren’t we?

18 February, 2017

OT: Ten pragmatic tips regarding pens & notebooks

A follow-up to my previous guide to fountain pens. Game Developers Conference is next week and you might want to take notes. Some practical advice for what I think it's the best equipment to take notes on-the-go...

1) Get a spiral-bound, A5 notebook. 

It's the only kind that not only stays easily flat open, but they completely fold in half and have thick cardboard backs, making it easy to hold them one-handed when you don't have a table.

Muji sells relatively inexpensive ones that are of a good quality. Midori is another brand I really like for spiral-bound ones.

Midori "polar bear" spiral notebook.
Muji fountain pen. Lamy safari. Kaweco sport. TWSBI Mini.
Rhodia is a favorite of many, but their spiral-bound notebooks (e.g. the very popular DotPad) have side perforations to allow to remove pages, unfortunately, these are very weak and will detach. Not good to carry around, only for temporary notes/scratch.

Stitched (threaded binding) and taped notebooks are the second best, they easily lay flat because only a few pages are stitched together, then these groups are bound together with tape. 
Notebooks held with only staples in the middle are the least flexible. 

2) You might want to prefer more absorbent paper for quick notes on the go.

Usually, fountain pens are used with smooth, non-absorbent paper that helps to avoid bleed-through, feathering, and allows the ink to dry over time, bringing out the eventual shading or sheen.

Unfortunately, this might not be the best for quick note taking (even if I don't mind it at all, the dry times are still quite fast with fine nibs), there are absorbent papers out there that work great with fountain pens. The Midori Cotton notebook is an example.

I also usually buy only notebooks with blank pages, not lined or gridded. That's because I tend to draw lots of diagrams and write smaller than the lines.

A Midori Cotton notebook, threaded binding.
Lamy studio and a Faber Castell Loom.
J.Herbin Perle Noire.
3) The best fountain pen for daily notes is a Vanishing Point (fine nib), bar none.

I have a fair collection of fountain pens, but nothing that touch the Namiki/Pilot Vanishing Points (a.k.a. Capless). They are incredible writers, especially in the smaller point sizes (from medium to extra-fine, which are the ones you'll want for notes).

They are fast and clean, due to the retractable nib, and they don't spill in airplanes either (to my experience, you might still want to travel with pens mostly full/without air bubbles and keep them upright during the trip).

Pilot Capless Decimo
The Capless Decimo and its bigger brother, the Vanishing Point.

4) The best cheap fountain pen is the Muji aluminum pen.

This might be actually hard to find, I got one from a store a year ago but never found another one in my subsequent visits since then.

I have the longer model, not the foldable one (which is somewhat worse). It's very cheap, it writes very well, it's not bulky but it's very solid, it works well in planes and it can easily hold a spare cartridge in the body. 
I also like that it uses a screw-on cap, which is a bit slower to upen but will ensure that ink doesn't get suddenly suctioned out the nib (as some tight push-on caps do, by creating a vacuum).
The only downside is that it's fairly skinny, which might not be too comfortable for some.

Alternatively, a starter pen that is as good (or maybe even better, my only Muji might have been an outlier...) is the Lamy Safari, a solid, no-nonsense German performer. It's a little bit more expensive than the Muji one, but it won't disappoint.
I hear great things about the Pilot Metropolitan as well, but I personally don't own one. Namiki/Pilot is probably, though, my favorite brand.

5) If you write a lot, avoid very thin and short pens.

Once upon a time, I used to love very compact pens, and still today I won't ever buy too bulky ones. But I did notice that very thin pens stress my hand more. Prefer pens with a decent diameter.

Some compact pens: a Visconti Viscontina,
A brass Kaweco Lilliput and a Spalding and Sons Mini.

6) You might want to prefer a waterproof ink.

Inks are most of the fun in playing with fountain pens! Inks that shade, inks with sheen, inks with flakes, pigmented inks, iron-gall inks... all kinds of special effects. 

It's all fun and games until a drop of water hits the page and your precious notes are completely erased... Extremely saturated inks might even smear just with the humidity from the hand!

So, especially if you're on the go, the best ink is a waterproof or at least water-resistant one, and one that flows well while drying fast. 

Often, the more "boring" inks are also the best behaved, like the Montblanc Midnight Blue or any of their permanent inks (black, blue, gray). 

My personal favorite, if I had to pick one, would be the Platinum Carbon Black, it's the best black ink I own, it flows perfectly, it's permanent and looks great.
Unfortunately, it's a bit harder to clean, being a pigmented ink, so I use it only in cheaper pens that I have no problems dismantling (it's a perfect match for the Muji pen).

I tend to prefer cartridge converters in my pens and I usually fill them from bottled ink with a syringe, it's less messy.

8) You won't look back at most of your notes.

Taking notes for me is just part of my thinking and learning process. I like it, and as I don't have a great memory, they work as some sort of insurance.

I have a notebook with me at all times, and every time I finish one, I quickly take photos of all pages with my phone and store them in my dropbox account. Easy.
I still find much easier to work on paper than with my anything else when it comes to notes and diagrams, so much that I will even often just draw things on paper, take a photo with an iphone and send it to my computer during discussions with co-workers, to illustrate a point.

That said, unless the notes are actual, actionable things I need to follow-up on (e.g. meeting notes, to-do lists, ideas to try, sketches for blog posts), I mostly don't look back at them, and I gather that this is common for most people. So, be aware of that!

9) Stick a few small post-its at the end of your notebook.

I always have a small number of different shaped post-its at the end of my notebook, so if I need to put a temporary note on a page, I can. I also use small removable stickers as bookmarks, often.

Another thing that I sometimes do, is to use both ends of a notebook. One side, e.g. from the front, I use for "permanent" notes, things that I am fairly certain about. Meetings, summaries of things I learn, blog posts and so on. 
The other side, e.g. going from the back, can be used as a scratchpad for random quick drawing, computations and so on, if you don't have a separate notebook for these...

10) Try not to go too crazy.

Fountain pens can instigate a compulsion to start a collection, easily. That might be good or not, depending on your point of view. But never fool yourself into thinking that there is a rational need for expensive pens.

Truth is, fountain pens are already a useless luxury, and lots of expensive ones are not really great writers, certainly not better than some good cheap ones. A pen is a pen.

05 February, 2017

Engineering for squishy bags of meat

The section you can skip

I'm Italian, and I grew up in a family of what you could call middle-class intellectuals. My parents came from proletarian families, and were the first generation being university educated, not working in the fields. 
A teacher and a doctor, embedded in the life of the city and involved in local politics; we used to often host people over for dinner. And I remember, at the time, being fairly annoyed at how people could be highly regarded for their knowledge and intelligence even if they were limited to the sole study of the humanities, without knowing even the basics of math, or scientific thinking.

Fast forward and today, as a computer science professional, society in my small bubble seems to have undergone a (non-violent) cultural revolution. Today "nerds" are considered "smart", the lack of social skills almost a badge of honor, and things seem to matter only when they relate to quantifiable numbers.

Not always, I have to say the professional world of smart computer scientists is not a stereotyped as your average Hollywood portrayal, yet too often we live in a world that is yet again too vertical: programmers, artists, managers, often snarking at the other categories lack of "skills".

"We wanted to build an elegant, robust, and beautiful product"

I've been wanting to write this post for a while now, I have various sketches in my notebook from almost a year ago, but recently I found this very well written post-mortem analysis of RethinkDB failure as a startup. I'm no database expert, but what I found particularly interesting is the section talking about the design principles behind their product.

You want to make a new database product; what is that you're going to focus on? For RethinkDB, the answer was three key factors: correctness, simplicity of interface and consistency.
Seems pretty reasonable, databases are the central infrastructure for most of today's applications, and who would like to build a billion-dollar product on something that doesn't guarantee correct operation? And definitely, today's startups need to move fast and are willing to adopt whatever new technology that helps to get results fast, so simplicity and consistency could give a competitive advantage.

Of course, it's no surprise as they were talking candidly about the reasons why the failed, that these were "the wrong metrics of goodness" (their words). What did people want? According to the article, the right metrics would have been: timely arrival, "palpable" speed (benchmarks, marketing), and a good use case. In other words, a product that:

1) Solved a problem (use case)
2) Now (timely arrival)
3) While making people happy (the importance of perceived performance)

This misjudgment led not only to the demise of the startup but also to a lot of frustration, depression, and anger, as what could be seen as inferior products made big impacts in the market.

Clash of worlds

The sin is, of course, the idea that an abstract notion of beauty matter at all. We write software to achieve given results, and these results are for all but the most theoretical of works, something that somehow has to make people happy. 
Everything else is just a mean towards that, a tool that can be used well or not but not a goal. The goal is always to sell to your market, to do something that people will... like. 
It requires understanding your customers and understanding that we are all people. It's the clash between "features" and "experiences".

When we think about features it's easy to end up in the measurement fallacy. We are doing X, and we are doing X measurably better for certain axes of measurement than another product, thus, we are better. Wait for adoption and success.
This is the peril of living in a verticle bubble of knowledge. People in tech think about tech, we work with it, we create it, it's easy and to a degree even inevitable to start technology for technology's sake.

But the truth is that user experience, workflows, and subjective experiences are about technology, and they are about research and computer science. 
They are just -harder- problems to solve, problems that most often have no good metric to optimize at all, sometimes because we don't know how to measure "soft" qualities, but many times because these qualities are truly hard to isolate. 
Better evolves through "blind" iteration and luck, and it's subject to taste, to culture, to society. Look at all the artifacts around you. Why is a musical instrument better than another one? Or a movie, game, car, pen, table, phone... 

Good design is humanistic in nature, we have to understand people, what they like, what they fear, how they use technology, how much they can adapt to change, what risks they evaluate. Objectively good is not good enough. Good design makes objective sense but overcomes also emotional, social, environmental obstacles.

Computer Science should be taught together with sociology, psychology, designinteractive arts.


To be fair though it's not surprising that people who are deeply invested in a given field, don't fare well in others. The tension between vertical knowledge, specialization, and horizontal knowledge that allows us to be well-rounded, is ineluctable. 
We are more and more pushed towards specialization because our fields are so deep and complex, and even the most brilliant minds don't have infinite mental resources, so there are always tradeoffs to be made.

Whether you went to a university or not, chances are that to become a specialist you needed a deep, technical immersion in a given field. We isolate ourselves into a bubble, and to a degree, we need to.
The real problem begins when we start thinking that a given bubble is representative of the world, that because given qualities matter inside it, they actually do matter.
When that starts, then we start adopting the wrong metrics to judge the world, and we end up frustrated when the world doesn't respond to them. 

Bubbles are of course a very general problem, even a very scary one when you think that really our world is becoming more and more complex to the degree that's not easily explicable even by experts, it's not easy to encapsulate in a set of rules. Complexity can push us inwards, to simplistic explanations and measure of goodness that we can become very attached to, even if they are very narrow-sighted.

It doesn't help that we, as humans, seek confirmation more than truth.

Academic research

Take for example the world of academia and scientific research. Researchers are judged on novelty, what a paper really needs is a measure, then, given such measure one can assert that a given technique is objectively better in a given context. 
If something is better, but not in a way that is easily measured, it's not something that can be easily published. This is not per se a terrible problem, it becomes a problem when we start thinking that these measures are all that really matters. 

I work in computer graphics; if I make a new system that generates simplified 3d models, and all the artists that use it think it's revolutionary, it produces a much better output and it's easier to use than anything else, did I make something novel and noteworthy? Surely yes, that's, in fact, all that matters.
But if this notion of how much better it is can't be measured in some kind of metric, it's not something that can be considered a novel research in academic terms. This is the nature of applied research, but it also creates a disconnect between academia and the rest of the world: a bubble, that we have to be aware of. 

When we ignore the existence of the bubble, frustration arises. Industry professionals are frustrated when academic research doesn't end up "working" for their needs. And researchers are frustrated when they see the industry "lagging behind" the theoretical state of the art.


Is it all bleak? To a degree, yes, it is when you realize that the world is not easily reducible to comforting measures that we like. That biases are inevitable, that specialization is inevitable and bubbles are inevitable.
I say that "experience is a variance reduction technique", we become more predictable and entrenched, better, at doing something, but we pay a price to it.

What we can certainly do though is to at least -aware- of these mechanisms. Doubt ourselves. Know that what we think it matters, might not matter at all, and seek different point of views. Know that what we think is rational, most of the times is just a rationalization. Be aware of the emotional factors in our own choices.

What we can certainly do is to break -a bit- out of our vertical pits of knowledge, and be just curious enough, learn just enough to be able to interface with different people. That can be just the solution we need, we can't know all, and we can't really understand the world without any filter. We can though make sure we are able to talk and engage with people that are different, that invested their verticality in other fields.

In the end, personally, I don't consider my journey towards being truly "smart" complete or even really started. But at least I am a bit aware... I also have the gift of a spouse that beats me down showing me how bad I still am at many aspects of "smartness", and a strong enough ego to take that beating. Small steps.

26 October, 2016

Over-engineering (the root of all evil)


Over-engineering: using prematurely for tools, abstractions or technical solutions, resulting in wasted effort and unnecessary complexity.

When is a technique used prematurely? When it doesn't solve a concrete, current problem. It is tempting to define good engineering in terms of simplicity or ease of development, but I think it's actually a slippery slope. Simplicity means different things to different people. 

One could see programming as compression (and indeed it is), but we have to realize that compression, or terseness, per se, is not a goal. The shortest program possible is most often not the simplest for people to work with, and the dangers of compression are evident to anybody that had to go through a scientific education: when I was in university the exams that were by far the most difficult came with the smallest textbooks...

Simplicity also means different things to different people. To someone in charge of low-level optimizations working with fewer abstractions can be easier than having to dive through many software layers. To a novice, a software written in a more idiomatic way for a given language might be much easier than something adapted to be domain specific.

Problems, on the other hand, measurable, concrete issues, are a better framework. 

They are still a soft, context-dependent and team-dependent metric, but trying to identify problems, solutions, and their costs brings design decision from an aesthetic (and often egocentric) realm to a concrete one.
Note: This doesn't mean that good code is not or shouldn't be "beautiful" or elegant, but these are not goals, they are just byproducts of solving certain problems the code might have.
Also, "measurable" does not mean we need precise numbers attached to our evaluations, in practice, most things can't be better than fuzzy guesses, and that's perfectly fine.

Costs and benefits

Costs are seldom discussed. If a technique, an abstraction, an engineering solution doesn't come with drawbacks, it's probably because either it's not doing much, or because we've not been looking hard enough. 

  • Are we imposing a run-time cost? What about the debug build? Are we making it less usable?
  • Are we increasing build-times, lowering iteration times? 
  • A human one, in terms of complexity, obfuscation, ability to on-board new engineers?
  • Are we making the debugging experience worse? Profiling?
  • Do our tools support the design well? Are we messing up with our source control, static analyzer and so on?
  • Are we decreasing code malleability? Introducing more coupling, dependencies? 
  • Reducing the ability to reason locally about code? Making details that matter in our context hidden at call-site? Making semantics less explicit, or less coupled to a given syntax? Violating invariants and assumptions that our code-base generally employs?
  • Does it work well with the team culture? With code reviews or automated testing or any other engineering practice of the team?
We have to be aware of the trade-offs to discuss an investment. But our tendency to showcase the benefits our ideas and hide the costs is a real issue in education, in research, in production. It's hardwired in the way we work. It's not (most often) even a matter of malice, it's simply the way we are trained to reason, we seek success and shy away from discussing failure.

I've seen countless time people going on stage, or writing articles and book, honestly trying to describe why given ideas are smart and can work while totally forgetting the pain they experience every day due to them.


And that's why over-engineering truly is the root of all evil. Because it's vicious, it's insidious, and we're not trained at all to recognize it. 

It is possible to go out and buy technical books, maybe go to a university, and learn tens or hundreds of engineering techniques and best practices. On the other hand, there is almost nothing, other than experience and practice, that teaches restraint and actual problem solving.

We know what under-engineering is, we can recognize duplicated code, brittle, and unsafe code, badly structured code. We have terminology, we have methodologies. Testing, refactoring, coverage analysis...

In most of the cases, on the other hand, we are not trained to understand over-engineering at all.

Note: In fact over-engineering it's often more "pronounced" in good junior candidates, whose curiosity lead them to learn lots of programming techniques, but that have no experience in their pitfalls and can easily stray from concrete problem solving.

This means most of the times when over-engineering happens it tends to persist, we don't go back from big architectures and big abstractions to simpler systems, we tend to build on top of them. Somewhere along the road, we made a bad investment, with the wrong tradeoffs, but now we're committed to it.

Over-engineering tends to look much more reasonable, more innocent than under-engineering. It's not bad code. It's not ugly code. It's just premature and useless, we don't need it, we're paying a high price for it, but we like it. And we like technology, we like reading about it, keeping ourselves up-to-date, adopting the latest techniques and developments. And at a given point we might even start thinking that we did make the right investment, that the benefits are worth it, especially as we seldom have objective measures or our work and we can always find a rationalization of almost any choice.

I'd say that under-engineering leads to evident technical debt, while over-engineering creates hidden technical debt, which is much more dangerous. 

The key question is "why?". If the answer comes back to a concrete problem with a positive ROI, then you're probably doing it right. If it's some vague other quality like "sharing", "elegance", "simplicity", then it's probably wrong, as these are not end goals.

When in doubt, I find it's better to err on the side of under-engineering, as it tends to be more productive than the opposite, even if it is more reviled.

"Premature optimization is the root of all evil" - Hoare, popularized by Knuth.

I think over-engineering is a super-set of premature optimization. In the seventies, when this quote originated, that was the most common form of this more "fundamental" evil.
Ironically, this lesson has been in the decades so effective that nowadays it actually helps over-engineering, as most engineers read it incorrectly, thinking that in general performance is not a concern early on in a project.

Intermission: some examples

- Let's say we're working on a Windows game made in Visual Studio. Let's say that you are using a Visual Studio solution and it's done badly, it uses absolute paths and requires the source-code and maybe some libraries to be in a specific directory tree on the hard drive. Anybody can tell that's a bad design, and the author might be scorned for such an "unprofessional" choice, but in practice, the problems that it could cause are minimal and can be trivially fixed by any programmer.

On the other hand, let's say we started using, for no good reason, a more complex build system, maybe packages and dependencies based on a fancy new external build tool of the week.

The potential cost of such a choice is huge because chances are that now many of your programmers aren't very familiar with this system, it's bringing no measurable benefits but now you've obfuscated an important part of your pipeline. Yet, it's very unlikely that such decision will be derided.

- Sometimes issues are even subtler, because they involve non-obvious trade-offs. A fairly hard-coded system might be painful in terms of malleability, maybe doing changes in this subsystem requires every time editing lots of source files even for trivial operations.

We really don't like that, so we replace this system with a more generic, data-driven one which allows to do everything live, doesn't even require to recompile code anymore. But say that such system was fairly "cold", and the changes were actually infrequent. Suppose also that the new system takes a fair amount more code and now our entire build is slower. We ended up optimizing a workflow that was infrequent but on the down side we slowed down the daily routine of all our programmers on the team...

- Let's say you use a class where you could have used a simple function. Maybe you integrate a library, where you could have written a hundred lines of code. You use a templated container library where you could have used a standard array or ad-hoc solutions. You were careless and now your system is becoming more and more coupled at build-time due to type dependencies. 

It's maybe a bit slower in runtime than it could be or it makes more dynamic allocations than it should, or it's slow in debug builds, and it makes your build time longer while being quite obscure when you actually have to step in this library code.

This is a very concrete example, happens often yet chances are that none of this will be recognized as a design problem, and we often see complex tools built on top over-engineered designs to "help" solving their issues. So now you might use "unity builds" and distributed builds to try to remedy the build time issues. You might start using complex memory allocators and memory debuggers to track down what's causing fragmentation and so on and so forth. 

Over-engineering invites more over-engineering. There is this idea that a complex system can be made simpler by building more on top of it, which is not very realistic.

Specialization and constraints

I don't have a universal methodology for evaluating return on investment, once the costs and benefits of a given choice are understood. And I think there isn't in general one because this metric is very context sensitive. What I like to invite engineers to do is to think about the problem, be acutely aware of it.

One of the principles I think is useful as a guidance is that we operate with a finite working set: we can't pay attention to many things at the same time, we have to find constraints that help us achieve our objectives. In other words, our objectives guide how we should specialize our project.

For example, in my job I often deal with numerical algorithms, visualization, and data exploration. I might code very similar things in very different environments and very different styles depending on the need. If I'm exploring an idea, I might use Mathematica or Processing. 

In these environments, I really know little about the details of memory allocations and the subtleties of code optimization. And I don't -want- to know. Even just being aware of them would be a distraction, as I would naturally gravitate towards coding efficient algorithms instead of just solving the problem at hand.

Often times my Mathematica code actually leaks memory. I couldn't care less when running an exploratory task overnight a workstation with 92 gb of ram. The environment completely shields me from these concerns and this is perfect, it allows me to focus on what matters, in that context. I write some very high-level code, and somehow magic happens.

Sometimes I have to then port these experiments to production C++ code. In that environment, my goals are completely different. Performance is so important to us that I don't want any magic, I want anything that is even remotely expensive to be evident in the location where it happens. If there was some magic that worked decently fast most of the times, you can be sure that the problems it creates would be lost until there are so many locations where that happens that the entire product falls apart.

I don't believe that you can create systems that are extremely wide, where you have both extremely high-level concerns and extremely low-level ones, jack-of-all-trades. Constraints and specialization are key to software engineering (and not only), they allow us to focus on what matters, keeping the important concerns in our working set and to perform local reasoning on code.

All levels

Another aspect of over-engineering is that it doesn't just affect minute code design decisions or even just coding. In general, we have a tendency to do things without proper awareness, I think, of what problems solve for us and what problem they create. Instead, we're often guided either by a certain aesthetic or certain ideals of what's good.

Code sharing for example and de-duplication. Standards and libraries. There are certain things that sometimes we consider intrinsically good, even when we have a history of failures from which we should learn. 

For engineering, sharing in particular is something that comes with an incredible cost but that is almost always considered a virtue per se, even by teams which have experience actually paying the price in terms of integration costs, of productivity costs, of code-bloat and so on, it came to be just considered "natural".

"Don't reinvent the wheel" is very true and sound. But "the wheel" to me means "cold", infrastructural code that is not subject to iteration, that doesn't need specialization for a given project. 

Thinking that sharing and standardization is always a win is like thinking that throwing more people at a problem is always a win, or that making some code multithreaded is always a win, regardless of how much synchronization it requires and how much harder it makes the development process...

In a videogame company, for example, it's certainly silly to have ten different math libraries for ten different projects. But it might very well not be silly to have ten different renderers. Or even twenty for what matters, rendering is part of the creative process, it's part of what we want to specialize, to craft to a given art-direction, given project scope and so on.


Context doesn't matter only on a technical level, but also, or perhaps even more, on a human level. Software engineering is a soft science!

I've been persuaded of this having worked for a few different projects in a few different companies. Sometimes you see a company using the same strategy for similar projects, only to achieve very different results. Some other times on the other hands, similar results are obtained by different products in different companies by employing radically different, almost opposite strategies. Why is that?

Because people matter more than technology. And this is perhaps the thing that we, as software engineers, are trained the least to recognize. People matter more than technology.

A team of veterans does not work the same as a team that has or needs a lot of turnover. In the game industry, in some teams innovation is spearheaded by engineers, in some others, it's pushed by artists or technical artists.

A given company might want to focus all their investment in a few, very high profile products, where innovation and quality matters a lot. Another might operate by producing more products and trying to see what works, and in that realm maybe keeping costs down matters more.

Even the mantras of sharing and avoiding duplication are not absolute. In some cases, duplication actually allows for better results, e.g. having a separate environment for experimentation than final production. In some cases sharing stifles creativity, and has upkeep costs that overall are higher than the benefits.

It's impossible to talk about engineering without knowing costs, benefits, and context. There is almost never a universally good solution. Problems are specific and local.

Engineering is about solving concrete problems in a specific context, not jumping carelessly on the latest bandwagon.

Our industry I feel, still has lots to learns.