Sketching Agent: Reconstruct sketches with human-like concise strokes based on Constrained Markov Decision Process

Gaofeng Liua, Jian Liub, Yongqi Shaoa, Xuetong Lia, Hong Huoa, Tao Fanga*
a Shanghai Jiao Tong University
b University of Shanghai for Science and Technology
Banner Image

Figure 1. Overview of Sketching Agent drawing a target sketch. In the inference process, the agent outputs stroke parameters according to the current start point and the canvas at each time. The renderer maps the strokes to the canvas. The indicates that the end point of the current stroke serves as the start point for the next stroke.

Abstract

Sketch reconstruction aims to recreate a target sketch by generating a sequence of vector strokes. Traditional methods often focus solely on the visual similarity of the final drawing while neglecting the stroke generation process, resulting in redundant strokes and disordered sequences. To address this limitation, we propose a sketch agent framework based on a Constrained Markov Decision Process (CMDP). To ensure the spatial continuity between adjacent strokes and get closer to the human drawing process, we introduce a hybrid action space for the sketching agent. Furthermore, we carefully design reward and cost functions to guide the agent in achieving efficient sketch reconstruction using more concise strokes while maintaining visual fidelity. Unlike existing methods that rely on supervised learning, our framework adopts a self-supervised learning paradigm, freeing it from the dependence on paired vector label data. Experimental results on the MNIST and QuickDraw datasets demonstrate the significant advantages of our approach in various sketch reconstruction tasks. Ablation studies further validate the effectiveness of our method in reducing the number of strokes and optimizing their sequence.

Method

In this study, we explore how to simulate humans and reconstruct the target sketch with more concise strokes and better order. Traditional methods focus more on the sketch reconstruction results and ignore the sketch reconstruction process, which makes the generated strokes have a messy order and redundant number. The sketch agent we proposed is a novel framework based on constrained Markov decision process, which can decompose the target sketch into more concise continuous vector strokes. Moreover, the sketch agent is based on self-supervised learning of pixel-level sketches, without the need for vector label data pairs. At the initial stage, the sketch agent starts drawing from the upper left corner. Instead of drawing strokes directly on the canvas, the agent relies on the information of the current canvas and the starting point to determine the stroke parameters that can be used to draw strokes at a specific moment during the reasoning process. The renderer converts the one-dimensional stroke parameters into a two-dimensional image and uses the end point of the current stroke as the starting point of the next stroke. Fig.2 shows multiple iterations of this process, which ultimately results in the reconstruction of a complete sketch. In order to maximize the cumulative reward while following the constraints, the sketch drawing agent chooses the appropriate operation at each step. This enables the agent to draw longer strokes and overlap the original image as much as possible, thus simplifying the sketch reconstruction process.

strokes_during_inference

Figure 2. During the sketch reconstruction, the strokes and canvas at each moment. The top line represents the sequence of strokes, while the bottom line indicates the strokes that have already been rendered on the canvas as of the current time. The red and yellow dots represent the start point and end point of the stroke respectively.

actor_critic_network

Figure 3. The architecture of the actor network and the critic network.

training_framework

Figure 4. The training framework of Sketching Agent.

Experiments

We evaluate our Sketching Agent on two datasets: MNIST and QuickDraw. MNIST comprises 70 000 handwritten digits (60 000 for training, 10 000 for testing), each a 28×28 grayscale image. QuickDraw contains 50 million sketches across 345 categories; we randomly sample 50 000 for training, 5 000 for validation, and 5 000 for testing, discarding category labels. All images are resized to 128×128; QuickDraw sketches are further augmented by scaling to 256×256 and extracting four random 128×128 crops. Both the actor and critic networks use ADAM (initial learning rates 1 × 10⁻⁴ and 1 × 10⁻³, step‐decay scheduler), a batch size of 48, and a replay buffer of 40 000. We train for 40 000 episodes on MNIST (max 10 steps per episode) and 20 000 on QuickDraw (max 40 steps), with discount factor γ=0.95. On a single NVIDIA 3090 Ti, training takes ~7 h for MNIST and ~48 h for QuickDraw. Episodes terminate when the step limit is reached or the pen is lifted twice consecutively.

Results

GIF 1 GIF 2 GIF 2 GIF 3 GIF 4 GIF 4 GIF 3 GIF 4 GIF 4 GIF 3 GIF 4 GIF 4 GIF 3 GIF 4 GIF 4 GIF 3 GIF 4 GIF 4 GIF 3 GIF 4 GIF 4 GIF 3 GIF 4 GIF 4 GIF 3 GIF 4 GIF 4 GIF 3 GIF 4 GIF 4 GIF 3 GIF 4 GIF 4 GIF 3 GIF 4 GIF 4 GIF 3 GIF 4 GIF 4 GIF 3 GIF 4 GIF 4 GIF 3 GIF 4 GIF 4 GIF 3 GIF 4 GIF 4
different_dataset_results

Figure 5. The sketching process and results on different datasets. Three columns from left to right represent the original image, the results of drawing contour, and stroke order, respectively.

Vector-Line-Art is tailored for line-art images, while Learning‑To‑Paint is designed for complex images. Therefore, we conducted comparative evaluations of these two methods on the QuickDraw and MNIST datasets. Figure 6 shows that, compared to Learning‑To‑Paint, the Sketching Agent can reconstruct sketches using continuous, human‑like strokes.

campare_with_LTP

Figure 6. Comparison with Learning-To-Paint on MNIST. Row (a) is the strokes sequence generated by Learning-To-Paint model, and Row (b) is the strokes sequence generated by our model.

Figure 7 demonstrates that our method, by employing more succinct continuous strokes, reconstructs the target image while preserving the same visual appearance.

campare_with_Vector-Line-Art

Figure 7. Comparison with Vector-Line-Art. Column (a) represents the target image. Columns (b) and (c) represent the results of our Sketching Agent and Vector-Line-Art, respectively. In columns (b) and (c), from left to right are the drawing results and the moving trajectory.

quantitative_comparison

Table 1. Quantitative comparison on QuickDraw and MNIST.

Ablation Study

Results of ablation experiments based on QuickDraw.

impact_of_status_reward

Figure 9. Impact of Status Reward/Cost on agent policies. The first column represents the target images. The second column is the result with Status Reward. The colored strokes show the movement trajectory during the reconstruction.

impact_of_end_point_reward

Figure 10. Impact of End Point Reward on agent policy. The output is the result with Stroke End Reward, and output* represents the result without Stroke End Reward.

ablation_experiment_quantitative_metrics

Figure 11. Comparison of quantitative metrics in ablation experiment.