A challenge for “neural pictures” is the amount of memory required for rendering even a small photo-like picture. Applying the largest GPUs and TPUs available at Google still may only produce a 2K image.
An alternative approach is to use vector graphics, that is, images not defined as easily-managed rows of pixels but as abstract lines and curves and fillable regions, as in software like Adobe Illustrator, inkScape, or formats like path-based SVGs. A sharp-edged circle renders as a sharp-edged circle regardless of the final print or JPG size.
Which is what I’ve been doing, and what this post is about.
The picture above is a rather scruffy and deliberately spare sketch guided by CLIP and based on the prompt phrase: “motorcycle leaning hard into a racing turn.” It’s made from vector strokes, rather than raster pixels.
Maybe you can see it, maybe not. Intriguing that the system can come up with something using just a few lines. One of my explorations is trying to see just how few strokes can work, with some subjects.
At the heart of such an approach is differentiable vector graphics, which in general hasn’t existed. Neighboring pixels in a raster image are differentiable – that is, you can look at a pixel and its neighbor and decide “what’s the difference?” – which is brighter or darker, and by how much? For random collections of curves, lines, and shapes, without an orderly grid… no. Every color and edge could be anywhere, independant of the others. Which is a challenge for analysis, because those relationships are key for deep learning and other image operations.
Enter a new, mixed approach, published for last December’s SIGGRAPH Asia: “Differential Vector Graphics Rasterization for Editing and Learning.”
This new renderer (“diffvg”) was out only a few months before being connected in Spring 2021 to OpenAI’s just-published CLIP. The combo was published by Kevin Frans of MIT, who with collaborators, as a Colab Notebook named CLIPDraw. You can try it yourself without reading the associated paper, but you really should read it – the paper’s full of good observations about the nature of current image datasets.
It’s been straightforward to add SVG outputs to CLIPDraw, as well as contrain it with image priors, color palettes, etc. From there I can print at 20K pixels or even higher.
I’m hardly the first to look at extending CLIPDraw (that’s kind of the point in a Colab notebook): this post by @RiversHaveWings is an exceptional example using much denser collections of strokes.
A curious behavior, if you watch such a dense painting develop, is that CLIPDraw learns line placement very separately from line color. In fact it often hardly moves the strokes at all, prefering to hash-around colors to get a picture to match its desired goal.
You can see it in three stages of development of some rather bisquity-looking clouds for the painting below. I’ve added arrows showing two arcs in a very early stage: one thick bright curve, one dark squiggle below it. In later stages of the same image, those two strokes are still there – almost unchanged save for their color.
This is very unlike a human artist, especailly when building drawings from very thin lines (think: pencil or ballpoint pen, like the motorcycle drawing). The computer sometimes “gets” that collections of strokes can be used to build up color in cross-hatching and shading – but not always.
You can also see a habit the system has of “signing” paintings in the corners – often writing out english words from the prompt, or even occasionally surprises like “Vincent.” Most image databases, scraped from the internet, can contain bits of text and the computer happily learns to add text-like shapes in the typical locations. This one might have a hint of the Artstation logo? Extra constraints need to be added to suppress the painter’s ego, you might say.
Further experiments yet to come!