Sketching Agent: Reconstruct sketches with human-like concise strokes based on Constrained Markov Decision Process

Gaofeng Liu^a, Jian Liu^b, Yongqi Shao^a, Xuetong Li^a, Hong Huo^a, Tao Fang^a*

^a Shanghai Jiao Tong University

^b University of Shanghai for Science and Technology

Code

Figure 1. Overview of Sketching Agent drawing a target sketch. In the inference process, the agent outputs stroke parameters according to the current start point and the canvas at each time. The renderer maps the strokes to the canvas. The indicates that the end point of the current stroke serves as the start point for the next stroke.

Method

In this study, we explore how to simulate humans and reconstruct the target sketch with more concise strokes and better order. Traditional methods focus more on the sketch reconstruction results and ignore the sketch reconstruction process, which makes the generated strokes have a messy order and redundant number. The sketch agent we proposed is a novel framework based on constrained Markov decision process, which can decompose the target sketch into more concise continuous vector strokes. Moreover, the sketch agent is based on self-supervised learning of pixel-level sketches, without the need for vector label data pairs. At the initial stage, the sketch agent starts drawing from the upper left corner. Instead of drawing strokes directly on the canvas, the agent relies on the information of the current canvas and the starting point to determine the stroke parameters that can be used to draw strokes at a specific moment during the reasoning process. The renderer converts the one-dimensional stroke parameters into a two-dimensional image and uses the end point of the current stroke as the starting point of the next stroke. Fig.2 shows multiple iterations of this process, which ultimately results in the reconstruction of a complete sketch. In order to maximize the cumulative reward while following the constraints, the sketch drawing agent chooses the appropriate operation at each step. This enables the agent to draw longer strokes and overlap the original image as much as possible, thus simplifying the sketch reconstruction process.

Figure 2. During the sketch reconstruction, the strokes and canvas at each moment. The top line represents the sequence of strokes, while the bottom line indicates the strokes that have already been rendered on the canvas as of the current time. The red and yellow dots represent the start point and end point of the stroke respectively.

Figure 3. The architecture of the actor network and the critic network.

Figure 4. The training framework of Sketching Agent.

Experiments

We evaluate our Sketching Agent on two datasets: MNIST and QuickDraw. MNIST comprises 70 000 handwritten digits (60 000 for training, 10 000 for testing), each a 28×28 grayscale image. QuickDraw contains 50 million sketches across 345 categories; we randomly sample 50 000 for training, 5 000 for validation, and 5 000 for testing, discarding category labels. All images are resized to 128×128; QuickDraw sketches are further augmented by scaling to 256×256 and extracting four random 128×128 crops. Both the actor and critic networks use ADAM (initial learning rates 1 × 10⁻⁴ and 1 × 10⁻³, step‐decay scheduler), a batch size of 48, and a replay buffer of 40 000. We train for 40 000 episodes on MNIST (max 10 steps per episode) and 20 000 on QuickDraw (max 40 steps), with discount factor γ=0.95. On a single NVIDIA 3090 Ti, training takes ~7 h for MNIST and ~48 h for QuickDraw. Episodes terminate when the step limit is reached or the pen is lifted twice consecutively.

Results

Figure 5. The sketching process and results on different datasets. Three columns from left to right represent the original image, the results of drawing contour, and stroke order, respectively.

Vector-Line-Art is tailored for line-art images, while Learning‑To‑Paint is designed for complex images. Therefore, we conducted comparative evaluations of these two methods on the QuickDraw and MNIST datasets. Figure 6 shows that, compared to Learning‑To‑Paint, the Sketching Agent can reconstruct sketches using continuous, human‑like strokes.

Figure 6. Comparison with Learning-To-Paint on MNIST. Row (a) is the strokes sequence generated by Learning-To-Paint model, and Row (b) is the strokes sequence generated by our model.

Figure 7 demonstrates that our method, by employing more succinct continuous strokes, reconstructs the target image while preserving the same visual appearance.

Figure 7. Comparison with Vector-Line-Art. Column (a) represents the target image. Columns (b) and (c) represent the results of our Sketching Agent and Vector-Line-Art, respectively. In columns (b) and (c), from left to right are the drawing results and the moving trajectory.

Table 1. Quantitative comparison on QuickDraw and MNIST.

Ablation Study

Results of ablation experiments based on QuickDraw.

Figure 9. Impact of Status Reward/Cost on agent policies. The first column represents the target images. The second column is the result with Status Reward. The colored strokes show the movement trajectory during the reconstruction.

Figure 10. Impact of End Point Reward on agent policy. The output is the result with Stroke End Reward, and output* represents the result without Stroke End Reward.

ablation_experiment_quantitative_metrics