IlluSign: Illustrating Sign Language Videos by Leveraging the Attention Mechanism

1Reichman University 2University of Zurich 3Tel Aviv University

Abstract

Sign languages are dynamic visual languages that involve hand gestures, in combination with non-manual elements such as facial expressions. While video recordings of sign language are commonly used for education and documentation, the dynamic nature of signs can make it challenging to study them in detail, especially for new learners and educators. This work aims to convert sign language video footage into static illustrations, which serve as an additional educational resource to complement video content. This process is usually done by an artist, and is therefore quite costly. We propose a method that illustrates sign language videos by leveraging generative models’ ability to understand both the semantic and geometric aspects of images. Our approach focuses on transferring a sketch-like illustration style to video footage of sign language, combining the start and end frames of a sign into a single illustration, and using arrows to highlight the hand’s direction and motion. While many style transfer methods address domain adaptation at varying levels of abstraction, applying a sketch-like style to sign language—especially for hand gestures and facial expressions—poses a significant challenge. To tackle this, we intervene in the denoising process of a diffusion model, injecting style as keys and values into high-resolution attention layers, and fusing geometric information from the image and edges as queries. For the final illustration, we use the attention mechanism to combine the attention weights from both the start and end illustrations, resulting in a soft combination. Our method offers a cost-effective solution for generating sign language illustrations at inference time, addressing the lack of such resources in educational materials.

Results

Our pipeline transforms video frames into a unified illustration by first converting two input frames into the target illustration style, then overlaying them and adding directional arrows to indicate motion. The rightmost column shows the ground truth illustration, included only for visual comparison as it is not used in our method.

How does it work?

Our method consists of two main stages. (1) The process begins by inverting images to latent noise and starting the diffusion process from the edges image noise. In the final resolution attention layers, style image features are injected using Keys and Values, while Queries blend features from both the image and edges. (2) A second diffusion stage fuses features from the start and end images, using start image noise for initialization. Finally, hand masks and unsimilar features between the Queries are injected into the the final Queries to enhance hand appearance and enable soft image blending.

Qualitative Comparison

Generalization to other styles

BibTeX

@INPROCEEDINGS{bruner2025illusign,
  author    = {Janna Bruner and Amit Moryossef and Lior Wolf},
  title     = {{IlluSign}: Illustrating Sign Language Videos by Leveraging the Attention Mechanism},
  booktitle = {2025 IEEE 19th International Conference on Automatic Face and Gesture Recognition (FG)}, 
  year      = {2025}
}