jl²

r/place vision transformer


The last 10 days I finally tried implementing an idea I got from a long time, an AI that simulates a r/place player.

I may have underestimated some parts, but I learned a lot in this project, from the dataset, the model(s) himself, and the post processing of AI output.

I will not enter in the details of the code here, or give quantitative metrics, I just wanted to yap about it.


The dataset

It was a way bigger challenge than I thought.

The data we got from reddit is a big CSV file of +160 millions lines of each pixel change with the color, the coordinates, the timestamp, and the user (anonymized), and it's not in the right order.

And the goal is to take a random idx from 0 and 160 million, and reconstruct the canvas at the timestamp from the pixel change n° idx.

The naive approach is to each iteration, take the idx, take the timestamp at the row, and go through all the rows to reconstruct the canvas from scratch. It is obviously not a viable option.

So I took a different approach, there are obviously better approaches, but this one worked for me :

All this implemented in a rust tool, it process the data in ~1h on my machine.

Now to get the canvas at an idx, I take the partition n° idx // 1 million, and I load the canvas from the file, and I apply the last changes for each x, y before the timestamp of the idx.

It was way faster, and I could run a batch of 192 at 4 it/s so ~800 canvas/s.

Now we got the input data, but we still need the target data. I have two datasets for two different models we will see later.

For the first one, the target is just the next color at the center of a canvas patch of 33x33, and for the second one, the target is for each pixel of the patch, when will it be changed in the future, normalized to an hour, so for more than 1 hour, it's just 1.0.

So it's just some different sql queries to get the target data, and process it in the python dataset class.


The model(s)

After a lot of tries, I ended up with an architecture combining two models, a model to predict the next color at the center of a patch, and a model to predict the time before the next change for each pixel of the patch.

Model architecture

The values for the d_ff, d_model, n_heads, n_layers, are from the paper An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

I train each model separately, and then I combine them in the inference. I could maybe post train them together, but I didn't try yet.

In fact, I didn't even trained the models on more than 1 epoch on more than 4 percent of the data, I prefer to enhance the dataset and the model before putting it in the oven, but I already got some interesting results.

Looking at the inference code, here's how the two models actually work together:

This combination allows the system to both decide where to place a pixel and what color that pixel should be. The "time" model is really more about predicting the location of the next change, while the color model determines what that change should be.

Despite the limited training so far, this dual approach seems to capture some of the complex behaviors seen in r/place, like the formation of structures and the spread of colors.

However, there's still a lot of room for improvement, especially in terms of consistency and long-term pattern formation.


The post processing

The problem with this project is that the users are not the same, each users have a different goal, a different style, and a different place in the canvas. The AI for now is just a mix of all the users, from bots, the void, the flags, and the art.

It's like reducing the whole canvas to something like a 64x64 canvas, it would be pure chaos, not enough space for everyone ideas.

So for now, I'm just trying to do a little post processing on the model output, I tried the following:

These post-processing steps aim to introduce variety and control into the model's output, helping to mitigate the "average behavior" problem and simulate more diverse user actions. The weight map helps reduce unwanted behaviors like excessive void creation, while choosing alternative colors and introducing randomness helps create more interesting and varied patterns.

Despite these efforts, balancing the various competing "user intentions" remains a significant challenge. The current approach is a starting point, but there's still room for more sophisticated strategies to better capture the diverse and sometimes conflicting goals of r/place participants.


The results

1.9 million parameters x 2, 1.5 million iterations model, 128x128, 17x17 patch (289 context size)

Results

85 million parameters x 2, 3 million iterations model, 256x256, 17x17 patch (289 context size)

Results


The future

I've got several ideas to improve this project:

There's still much room for improvement, it's only the start, the "gpt-2" of this approach. I'm committed to refining this project further and will share updates once I'm satisfied with the progress.

I welcome any suggestions, collaborations, or discussions about the project. Feel free to reach out if you have ideas or want to contribute.

The code will be available on my GitHub soon, and I'll provide the link here when it's ready.