nicole_bellezza — virtual actress

How to build
an artificial actress

Nicole Bellezza is a real person who lent her image for the creation of the AI character in the film. Her photographs were used to train an image generation model with Stable Diffusion / SDNext, producing a virtual actress who retains her likeness. KlingAI animated the generated images; LocalAI synthesised the voice. This page documents the technical process and shows some of the photographs used for training.

The technical process

01

Dataset collection

To train the model, approximately 20–30 photographs of the real person were selected, taken under widely varying lighting conditions, angles, and contexts. Variety is essential: an overly homogeneous dataset produces a rigid embedding, unable to generalise to new poses or lighting situations.

02

Textual Inversion with SDNext

The Textual Inversion process, run locally with SDNext, trains a new token — in this case the trigger word `nicole_bellezza` — associating it with the visual features extracted from the dataset. The base model (Stable Diffusion) remains unchanged: only the textual embedding space is modified, adding a new 'concept' that the model learns to recognise and reproduce.

03

The embedding mechanism

A Textual Inversion embedding is a vector in the latent space of the CLIP text encoder. During training, the model optimises this vector so that, when used as input, it guides the diffusion process towards images consistent with the learned face. The result is a `.pt` file of a few kilobytes that 'contains' Nicole's face.

04

Scene generation

Once the embedding was obtained, each scene in the film was generated using the trigger word in the prompt: `portrait of nicole_bellezza, dramatic lighting, dark fantasy, mountain background`. The model produces images consistent with the learned face, adapting it to the required narrative context — from the natural Nicole to the horned masca, through to frames overlaid on the HUD interface.

05

Animation with KlingAI

The static images generated by SDNext were animated with KlingAI, a commercial image-to-video generation system developed by Kuaishou. KlingAI is the only non-open-source component in the entire pipeline: it interpolates frames, adds movement to the eyes, lips, and hair, and produces 5–10 second clips that maintain the consistency of the learned face. It is the step that transforms Nicole from a photographic character into a cinematic one.

06

Voice with LocalAI + TTS

Nicole's voice was generated with LocalAI, an open-source local inference server compatible with the OpenAI API, using a TTS (Text-to-Speech) model run entirely locally without sending data to external services. The dialogue text — written by the author — is converted to audio with a synthesised female voice, then aligned to the video clips during editing.

07

Lip sync

Lip synchronisation was achieved in two complementary ways: for scenes with direct dialogue, KlingAI generates lip movement directly from the TTS audio during the animation phase. For scenes where the speech is in the background or off-screen, manual editing ensures audiovisual consistency without requiring a dedicated lip sync pass. No separate lip sync model was used: the pipeline is deliberately kept minimal.

08

Compositing with DaVinci Resolve AI

The overlay of Nicole's scenes onto the real footage shot in the mountains (Lake Veillet, Aosta Valley) was carried out in DaVinci Resolve using the AI tools integrated in the Fusion module. The process unfolds in three phases: (1) Magic Mask — DaVinci's neural tracker automatically segments the subject (Nicole) frame by frame, separating her from the SDNext-generated background; (2) Alpha output — the mask is used as an alpha channel to isolate the character cleanly, without manual rotoscoping; (3) Compositing in Fusion — Nicole's layer is overlaid on the real footage with differential colour correction to match the light temperature between the two sources (artificial light from AI generation vs. natural light from the high-altitude footage). The result is the effect of physical presence of an artificial character in a real landscape.

$ sdnext --train textual_inversion --dataset ./nicole_photos/ --token nicole_bellezza --steps 3000 --lr 0.005
$ sdnext --generate --prompt "portrait of nicole_bellezza, dark fantasy, mountain, dramatic light" --embedding ./nicole_bellezza.pt
$ // embedding size: 768 dimensions — CLIP ViT-L/14 text encoder

Training dataset — selection

Some of the photographs of the real person used to train the embedding. The variety of poses, lighting conditions, and contexts is deliberate: the more heterogeneous the dataset, the more stable and generalisable the features the model learns.

Technical stack — licences

The vast majority of the tools used to build Nicole are open source: Stable Diffusion and SDNext are released under AGPL-3.0 / Apache 2.0, LocalAI under MIT, and the TTS models used are distributed under permissive licences. The only exception is KlingAI (Kuaishou), a commercial service used for image-to-video animation and for lip sync in scenes with direct dialogue. The choice to use KlingAI was driven by the absence, at the time of production, of open-source alternatives with comparable quality for realistic face animation.

The photographs shown on this page have been used with the consent of Nicole Bellezza solely for the purpose of training the image generation model used in the film. The images generated by the model are an artistic elaboration and do not represent the real person outside the context of the short film.