My findings on the impact of regularization images & captions in training a subject SDXL Lora with Dreambooth

Aug 10, 2023

I wanted to research the impact of regularization images and captions when training a Lora on a subject in Stable Diffusion XL 1.0 (SDXL 1.0) using Dreambooth. I used SDXL 1.0 with the baked 0.9 VAE throughout this experiment.

There’s a couple of things I wanted to find out with the resulting Lora’s:

How capable is the model able to generate photos of me?
How versatile is the model in generating non-photos of me (drawing, 3D cartoon, …)
What’s the impact of my model when trying to render images which should not contain me? (aka: does my model leak into general classes such as the word “man”)
How long does it take to train a Lora on a subject?

This post is a summary of my findings.

Input images

I have 47 input images, most of them close-up or half body. 3 full body shots.

I described these images with captions like this, using a special keyword w0ut3r. I followed the Captioning Datasets for Training Purposes to create those captions:

w0ut3r, photo of a man, half body shot, from front, depth of field, looking into camera, standing, arms crossed, holding frying pan and kitchen knife, mouth closed, slightly smiling, white t - shirt, red apron, indoors, kitchen stove and knives in background

Regularization images

I’ve gathered 3 different sets regularization images, which I’ll use to train my Lora. The images below are just a sample, I have at least 500 images per set.

**SDXL 1.0 Generated** images, using the **same detailed prompts** (but without the "w0ut3r" keyword)

**SDXL 1.0 Generated** images, using a **same simple prompt** (photo of a man).

Photos I gathered on **unsplash**. I saved the alt descriptions as well to use them later on for captioning

Training plan

There’s quite a couple of combination of inputs (14), regularizations and captions possible, which I mapped out in this table. I gave each combination a version number, which I’ll use later on to refer to them.

Table containing combinations with captions and inputs

First training: v002

I picked configuration v002 as my first attempt. I expected this version to give me good results.

This configuration has the following data:

image folder name 9_w0ut3r man (used 9 repeats).
image folder has detailed captions like w0ut3r, photo of a man [+extra details].
there are regularization images in a folder called 1_man. These images were created with SDXL, with input prompts based upon the contents of my real images.
the regularization images have detailed captions, containing that prompt (e.g photo of a man [+extra details]).

I used the following Kohya Google Colab to do my training: https://github.com/camenduru/kohya_ss-colab

In training the Lora, I was aiming for around 2000 steps, creating 10 snapshots during the process. When logging the training command, you can see how it calculates the number of steps:

INFO     Folder 9_w0ut3r man: 47 images found                   
INFO     Folder 9_w0ut3r man: 423 steps                                                    
INFO     Regularisation images are used...
         Will double the  
         number of steps required...                      
INFO     Total steps: 423                                       
INFO     Train batch size: 4                                    
INFO     Gradient accumulation steps: 1.0                       
INFO     Epoch: 10                                              
INFO     Regulatization factor: 2                               
INFO     max_train_steps (423 / 4 / 1.0 * 10 * 2) = 2115

Here are a few settings I used worth noting:

learning_rate: 0.0004
learning rate scheduler: cosine_with_restarts
training batch size: 4
10 epochs
save every epoch
bf16 precision
added the extra network_train_unet_only training command, as recommended in the Kohya_ss documentation for SDXL. The immidiate impact of this parameter I noticed, is that the resulting file is a bit smaller.

With my settings, the training command itself looked like this:

accelerate launch --num_cpu_threads_per_process=1 "./sdxl_train_network.py" --enable_bucket --pretrained_model_name_or_path="/content/drive/MyDrive/sd/stable-diffusion-webui/models/Stable-diffusion/sd_xl_base_1.0_0.9vae.safetensors" --train_data_dir="/content/drive/MyDrive/sd/_sdxl-training/v002/img" --reg_data_dir="/content/drive/MyDrive/sd/_sdxl-training/v002/reg" --resolution="1024,1024" --output_dir="/content/drive/MyDrive/sd/_sdxl-training/v002/model" --logging_dir="/content/drive/MyDrive/sd/_sdxl-training/v002/log" --network_alpha="1" --save_model_as=safetensors --network_module=networks.lora --text_encoder_lr=4e-05 --unet_lr=0.0004 --network_dim=128 --output_name="wouter_xl_1.0_v002" --lr_scheduler_num_cycles="10" --no_half_vae --learning_rate="0.0004" --lr_scheduler="cosine_with_restarts" --train_batch_size="4" --max_train_steps="2115" --save_every_n_epochs="1" --mixed_precision="bf16" --save_precision="bf16" --caption_extension="txt" --cache_latents --cache_latents_to_disk --optimizer_type="Adafactor" --optimizer_args scale_parameter=False relative_step=False warmup_init=False --max_data_loader_n_workers="0" --bucket_reso_steps=64 --gradient_checkpointing --xformers --bucket_no_upscale --network_train_unet_only

I fired up the command, and after about an hour my Lora was ready, resulting in 10 checkpoints (1 for each epoch). Each checkpoint was about 650MB in size. I created an xy plot, where I plotted each checkpoint with a fixed prompt, where I varied the Lora strength from 0.1 to 1.0. I did not use the SDXL refiner, just the SDXL base model, as my Lora was trained on just the base model.

Prompt: photo of a w0ut3r man <lora:wouter_xl_1.0_v002-000001:0.1>
Seed: 156680208700286
Sampler: dpmpp_2s_ancestral
20 steps
Config scale: 7
1024x1024 resolution
X Prompt S/R: :0.1,:0.2,:0.3,:0.4,:0.5,:0.6,:0.7,:0.8,:0.9,:1.0
Y Prompt S/R: wouter_xl_1.0_v002-000001,wouter_xl_1.0_v002-000002,wouter_xl_1.0_v002-000003,wouter_xl_1.0_v002-000004,wouter_xl_1.0_v002-000005,wouter_xl_1.0_v002-000006,wouter_xl_1.0_v002-000007,wouter_xl_1.0_v002-000008,wouter_xl_1.0_v002-000009,wouter_xl_1.0_v002

This resulted in the following plot:

training result grid of v002

Looking at this plot, I noticed that after about 5 epochs (which was equal to about 1000 steps in my case) the Lora was pretty able to generate a picture of me. However, the pictures at Lora strength one all seemed to have too high contrast and noise.

I tried a few prompts, with the same seed, steps & config scale as the grid above. Ideally, your Lora shouldn’t affect the general class you’re levaraging. In my case, it shouldn’t affect prompts with the just the word man (and not my w0ut3r special keyword).

photo of a w0ut3r man, with my Lora - I'm definately recognizable, but the picture quality is bad: it looks too sharp, with too much contrast

photo of a man, with my Lora - I'm not there (which is good), but the picture is very smoothed out

photo of a w0ut3r man, without my Lora - as SDXL has no clue what `w0ut3r` means, it generates a random person

photo of a man, without my Lora, a different person, but a lot sharper than the smoothened-out version

This was an indication to me that my regularisation images were probably not good enough. I wanted to save some time, and generated those images with the Euler ancestral sampler, 20 steps, without the refiner.

Here’s how one of those regularization images looked like:

regularization image, which was not refined

I think the smoother look of these regularization images was amplified by my training process, making the w0ut3r images sharper, and the man images smoother than they should have been.

So, I generated another batch of regularization images, using the same, detailed prompts, but this time I took some extra time to use the SDXL refiner on them and switched the sampling method to dpmpp_2s_ancestral.

new regularization image, contains more detail

I trained my Lora again, using the same settings as before, but this time with the new regularization images. I called this version v002d. Letter “d”: yes, I tried fiddling with the learning rates and scheduler in 002b and 002c, but that didn’t make any difference.

training result grid of v002

The picture quality was a lot better in this case.

photo of a w0ut3r man, with my Lora - it's sill overtrained, but no more high contrast artifacts or grain

photo of a man, with my Lora - while it still has a smoothened look, it's a lot better than before

Defining some prompts

With a first version being ok, I defined a few prompts I was going to try on the 14 versions of the Lora. Keeping my initial questions (not leaking into main class, versatility and speed) in mind.

example output of the prompts, without a Lora

“photo of a man”
“photo of a w0ut3r man”
“photo of a man, 1 /4 headshot, from side, looking away, mouth closed, smiling”
“photo of a w0ut3r man, 1 /4 headshot, from side, looking away, mouth closed, smiling”
“breathtaking selfie man on a cliff, Fjords in background, award winning, professional”
“breathtaking selfie w0ut3r man on a cliff, Fjords in background, award winning,professional”
“photo of a man in Southern France, wearing a plain red t - shirt and darkblue shorts, Provence, lavender fields in background, hdr”
“photo of a w0ut3r man in Southern France, wearing a plain red t - shirt and darkblue shorts, Provence, lavender fields in background, hdr”
“line art drawing of a man, headshot, professional, sleek, modern, minimalist, graphic line art, vector graphics”
negative: “anime, photorealistic, realism, realistic”
“line art drawing of a w0ut3r man, headshot, professional, sleek, modern, minimalist graphic, line art, vector graphics”
negative: “anime, photorealistic, realism, realistic”
“man looking into the camera by Pixar and Dreamworks, cinematic, muted colors, faded”
negative: “photo, photographic, photorealistic, realism”
“w0ut3r man looking into the camera by Pixar and Dreamworks, cinematic, muted colors faded”
negative: “photo, photographic, photorealistic, realism”

The beauty of SDXL is you need less “special keywords” or negative prompting to get good results.

The other settings for generating the images were:

seed: 156680208700286
steps: 20
cfg: 7
sampler_name: “dpmpp_2s_ancestral”

Generating a ton of images

I needed to generate quite a lot of images, which called for some automation. While doing the initial rendering of the images of v002, I had quite a couple of times A1111 Webui crashed on me, running out of memory after a while. ComfyUI was a lot more stable in my experience, but had less built-in options to generate the variations I needed.

Luckely, it does come in with a built-in api which I could leverage.

I set up a very basic ComfyUI workflow with a Lora in between, exported it as an api json, and set up a python script to send api requests to ComfyUI:

import os
import sys
import requests
import json
from PIL import Image

comfyUIUrl = 'http://127.0.0.1:8188/prompt'

def main():
  # load the workflow json
  apiJSON = json.load(open('workflow_api.json'))

  promptTemplate = {
    'filename_prefix': "v014_selfie",
    'positive': "photo of a w0ut3r man",
    'negative': "",
    'seed': 156680208700286,
    'steps': 20,
    'cfg': 7,
    'sampler_name': "dpmpp_2s_ancestral",
    'lora_name': "wouter_xl_1.0_v014.safetensors",
    'lora_strength_model': 1,
    'lora_strength_clip': 1,
  }

  # load the axis
  # xAxis = json.load(open('data/lora_selfie_x.json'))
  # yAxis = json.load(open('data/lora_strength_y.json'))
  xAxis = [
    {
      "positive": "photo of a man",
      "negative": ""
    },
    {
      "positive": "photo of a w0ut3r man",
      "negative": ""
    },
    {
      "positive": "photo of a man, 1 /4 headshot, from side, looking away, mouth closed, smiling",
      "negative": ""
    },
    {
      "positive": "photo of a w0ut3r man, 1 /4 headshot, from side, looking away, mouth closed, smiling",
      "negative": ""
    },
    {
      "positive": "breathtaking selfie man on a cliff, Fjords in background, award winning, professional",
      "negative": ""
    },
    {
      "positive": "breathtaking selfie w0ut3r man on a cliff, Fjords in background, award winning, professional",
      "negative": ""
    },
    {
      "positive": "photo of a man in Southern France, wearing a plain red t - shirt and dark blue shorts, Provence, lavender fields in background, hdr",
      "negative": ""
    },
    {
      "positive": "photo of a w0ut3r man in Southern France, wearing a plain red t - shirt and dark blue shorts, Provence, lavender fields in background, hdr",
      "negative": ""
    },
    {
      "positive": "line art drawing of a man, headshot, professional, sleek, modern, minimalist, graphic, line art, vector graphics",
      "negative": "anime, photorealistic, realism, realistic"
    },
    {
      "positive": "line art drawing of a w0ut3r man, headshot, professional, sleek, modern, minimalist, graphic, line art, vector graphics",
      "negative": "anime, photorealistic, realism, realistic"
    },
    {
      "positive": "man looking into the camera by Pixar and Dreamworks, cinematic, muted colors, faded",
      "negative": "photo, photographic, photorealistic, realism"
    },
    {
      "positive": "w0ut3r man looking into the camera by Pixar and Dreamworks, cinematic, muted colors, faded",
      "negative": "photo, photographic, photorealistic, realism"
    }
  ]
  yAxis = [
    {
      "lora_name": "wouter_xl_1.0_v014-000001.safetensors"
    },
    {
      "lora_name": "wouter_xl_1.0_v014-000002.safetensors"
    },
    {
      "lora_name": "wouter_xl_1.0_v014-000003.safetensors"
    },
    {
      "lora_name": "wouter_xl_1.0_v014-000004.safetensors"
    },
    {
      "lora_name": "wouter_xl_1.0_v014-000005.safetensors"
    },
    {
      "lora_name": "wouter_xl_1.0_v014-000006.safetensors"
    },
    {
      "lora_name": "wouter_xl_1.0_v014-000007.safetensors"
    },
    {
      "lora_name": "wouter_xl_1.0_v014-000008.safetensors"
    },
    {
      "lora_name": "wouter_xl_1.0_v014-000009.safetensors"
    },
    {
      "lora_name": "wouter_xl_1.0_v014.safetensors"
    }
  ]

  # generate prompts for these two axis
  prompts = []
  for y in yAxis:
    for x in xAxis:
      prompt = promptTemplate.copy()
      prompt.update(x)
      prompt.update(y)
      prompts.append(prompt)
  
  # send the prompts
  for prompt in prompts:
    send_prompt(apiJSON, prompt)

def send_prompt(apiJSON, prompt):
  samplers = get_nodes_by_class_type(apiJSON, 'KSampler')
  for sampler in samplers:
    sampler['inputs']['seed'] = prompt['seed']
    sampler['inputs']['steps'] = prompt['steps']
    sampler['inputs']['cfg'] = prompt['cfg']
    sampler['inputs']['sampler_name'] = prompt['sampler_name']
    positive_prompt_input = apiJSON[sampler['inputs']['positive'][0]]
    negative_prompt_input = apiJSON[sampler['inputs']['negative'][0]]
    if positive_prompt_input:
      positive_prompt_input['inputs']['text'] = prompt['positive']
    if negative_prompt_input:
      negative_prompt_input['inputs']['text'] = prompt['negative']
  lora_nodes = get_nodes_by_class_type(apiJSON, 'LoraLoader')
  for lora_node in lora_nodes:
    lora_node['inputs']['lora_name'] = prompt['lora_name']
    lora_node['inputs']['strength_model'] = prompt['lora_strength_model']
    lora_node['inputs']['strength_clip'] = prompt['lora_strength_clip']

  save_nodes = get_nodes_by_class_type(apiJSON, 'SaveImage')
  for save_node in save_nodes:
    save_node['inputs']['filename_prefix'] = prompt['filename_prefix']
  response = requests.post(comfyUIUrl, data=json.dumps({'prompt': apiJSON}), headers={'Content-Type': 'application/json'})

def get_nodes_by_class_type(apiJSON, classType):
  nodes = []
  for key in apiJSON:
    node = apiJSON[key]
    hasClassTypeProperty = 'class_type' in node
    isClassType = hasClassTypeProperty and node['class_type'] == classType
    if isClassType:
      nodes.append(node)
  return nodes

if __name__ == "__main__":
  main()

By executing python run_batch.py it then executed the api calls to ComfyUI.

I scheduled image generations for:

Lora strength tables: 14 tables where I wanted to plot each of the 10 epochs and the Lora impact from 0.1 to 1.0. Each one of these tables contains 100 images, resulting in 1400 images.
Prompt tables: for each of the 14 Lora’s, at each epoch, plot the 12 prompts at Lora strength 1. Each one of these tables contains 120 images, resulting in 1680 images.

So in total, I needed to generate 3080 images. I didn’t want to spend cloud credits on work I could do offline: I had access to 2 RTX 2060 laptops. It took on average about 70 seconds per image (these laptops weren’t the fastest). In total I had to wait for about 30 hours before all rendering was done. ComfyUI didn’t crash once.

Generating the plots

I wasn’t using webui (with it’s built-in xyz plot script), and I wanted some of the resulting images to end up in different tables. So I wrote a python script to generate plots for me, where I can specify an input folder, column count and labels:

import os
import sys
import math
import argparse
from PIL import Image, ImageFont, ImageDraw, ImageColor, PngImagePlugin

roboto_ttf_file = os.path.join("./", 'Roboto-Regular.ttf')

def main(args):

  image_path = args.image_path
  num_cols = args.num_cols
  column_labels = None
  row_labels = None

  target_column_width = 360

  grid_width = (num_cols + 1) * target_column_width
  font_size = 48

  if (args.column_labels is not None):
    column_width = math.ceil(grid_width / (num_cols + 1))
    column_labels = args.column_labels
  else:
    column_width = math.ceil(grid_width / num_cols)

  # get the jpg or png files in the folder, ignore files starting with a dot
  files = [f for f in os.listdir(image_path) if (f.endswith('.png') or f.endswith('.jpg')) and not f.startswith('.')]

  # sort the files by name, ascending
  files.sort()

  print('found ' + str(len(files)) + ' files')

  # max images, if specified
  if args.max_images is not None:
    max_images = args.max_images
    files = files[:max_images]
    
  num_rows = math.ceil(len(files) / num_cols)

  if (args.row_labels is not None):
    grid_height = (num_rows + 1) * column_width
    row_labels = args.row_labels
  else:
    grid_height = num_rows * column_width

  # create a grid with those images
  grid = Image.new('RGB', (grid_width, grid_height), (255, 255, 255))

  # loop through the files
  for i in range(len(files)):
    # open the image
    image = Image.open(os.path.join(image_path, files[i]))
    # resize the image to the column width
    image = image.resize((column_width, column_width))
    x = (i % num_cols) * column_width
    y = math.floor(i / num_cols) * column_width
    if args.column_labels is not None:
      x += column_width
    if args.row_labels is not None:
      y += column_width
    # paste the image into the grid
    grid.paste(image, (x, y))

  # draw the column labels
  if args.column_labels is not None:
    font = ImageFont.truetype(roboto_ttf_file, font_size)
    draw = ImageDraw.Draw(grid)
    for i in range(len(column_labels)):
      col_x = (i + 1) * column_width
      col_y = 0
      # measure the text size
      text_width = font.getlength(column_labels[i])
      # split the text into words
      words = column_labels[i].split(' ')
      # start with the first word
      text = words[0]
      # loop through the rest of the words
      for j in range(1, len(words)):
        # measure the text size
        text_width = font.getlength(text + ' ' + words[j])
        # if the text is too wide for the column, add a new line
        if text_width > column_width:
          text += '\n' + words[j]
        else:
          text += ' ' + words[j]
      # split the text into lines
      lines = text.split('\n')
      text_height = len(lines) * font_size
      # loop through the lines
      for j in range(len(lines)):
        # measure the text size
        text_width = font.getlength(lines[j])
        # draw the text
        x = col_x + (column_width - text_width) / 2
        y = col_y + (j * font_size) + (column_width - text_height) / 2
        draw.text((x, y), lines[j], font=font, fill=(0, 0, 0))
  
  # draw the row labels
  if args.row_labels is not None:
    font = ImageFont.truetype(roboto_ttf_file, font_size)
    draw = ImageDraw.Draw(grid)
    for i in range(len(row_labels)):
      # measure the text size
      text_size = font.getlength(row_labels[i])
      # draw the text
      x = (column_width - text_size) / 2
      y = (i + 1) * column_width + (column_width - font_size) / 2
      draw.text((x, y), row_labels[i], font=font, fill=(0, 0, 0))

  # save the grid, filename is the same as the folder name, store it one directory above the input image path
  output_filename = os.path.basename(os.path.normpath(image_path)) + '.png'
  output_path = os.path.join(os.path.dirname(image_path), output_filename)
  grid.save(output_path)

def setup_parser() -> argparse.ArgumentParser:
  parser = argparse.ArgumentParser(description='Create a grid of images from a folder of images')
  parser.add_argument('--image_path', type=str, default=None, help='The path to the folder of images')
  parser.add_argument('--num_cols', type=int, default=None, help='The number of image columns in the grid')
  parser.add_argument('--max_images', type=int, default=None, help='The maximum number of images to include in the grid')
  # list of strings for column labels
  parser.add_argument('--column_labels', action='append', type=str, default=None, help='The labels for each column')
  parser.add_argument('--row_labels', action='append', type=str, default=None, help='The labels for each row')
  return parser

if __name__ == "__main__":
  parser = setup_parser()
  args = parser.parse_args()
  main(args)

I could call this script like so:

python create_grid_of_images.py --num_cols=12 --column_labels="photo man" --column_labels="photo w0ut3r man" --column_labels="photo man side" --column_labels="photo w0ut3r man side" --column_labels="selfie man fjord" --column_labels="selfie w0ut3r man fjord" --column_labels="photo man provence" --column_labels="photo w0ut3r man provence" --column_labels="line art drawing man" --column_labels="line art drawing w0ut3r man" --column_labels="man Pixar & Dreamworks" --column_labels="w0ut3r man Pixar & Dreamworks" --row_labels="v001 epoch 5" --row_labels="v002 epoch 7" --row_labels="v003 epoch 7" --row_labels="v004 epoch 4" --row_labels="v005 epoch 4" --row_labels="v006 epoch 10" --row_labels="v007 epoch 6" --row_labels="v008 epoch 6" --row_labels="v009 epoch 6" --row_labels="v010 epoch 9" --row_labels="v011 epoch 5" --row_labels="v012 epoch 6" --row_labels="v013 epoch 6" --row_labels="v014 epoch 9" --image_path='/Volumes/Extreme SSD/sd/blog/grid-good-simple-photo'

Ok, it’s a long command, but with some smart find and replace it didn’t take too much work. It was also a flexible solution for me to generate different grids, with different labels, and reuse some of the images where applicable.

The results

Training leaks into the main class

When generating images without my w0ut3r keyword, do the other people look like me (which they shouldn’t)?

This is the baseline, where no Lora is applied:

Here’s a table plotting epoch 10 of each of the Loras, these prompts do not contain my keyword:

I put the evaluation of the 14 versions in a table, where I could also see the original configuration (reg images, captions, etc…):

Conclusion on leaking

The only good version is v002: detailed captions on both the input and regularization images, the regularization images are generated with the captions from the input image.
I need captioned regularization images for my class not to leak
Captioning my input images has less effect on leaking, but makes my model less reliable to control: while it doesn’t leak, I wasn’t able to create photos of me in different settings.

For the next tests, I’ll ignore the leaking into the main class. I’m going to focus on the quality of the images of the subject.

First epoch with a good, simple photo of the subject

For this benchmark, I wanted to see how fast the Lora learns to generate a simple photo of me. It’s more important that I’m recognizable, than to have a good picture quality. I defined a simple photo as a short prompt, with words that are present in the captions.

“photo of a man”
“photo of a w0ut3r man”
“photo of a man, 1 /4 headshot, from side, looking away, mouth closed, smiling”
“photo of a w0ut3r man, 1 /4 headshot, from side, looking away, mouth closed, smiling”

Conclusion for simple photos

The fasted trained model with good results is where the input is captioned and the regularization images are real photos. (v004 and v005).
Captioning my input images is the best chance of good quality and faster learning than no input captions.
I can get away without regularization images, as long as I caption my input images (v001).
Without input captions, don’t bother with regularization captions.
Only acceptable result without input captions is with real photo regularization images (v012), not with generated images.

First epoch with a good, “complex” photo of the subject

For this benchmark, I wanted to see how fast the Lora learns to generate a complex photo of me. I define a “complex photo” as a prompt, with words that are not present in the captions. It’s more important that I’m recognizable, than to have a good picture quality

“breathtaking selfie man on a cliff, Fjords in background, award winning, professional”
“breathtaking selfie w0ut3r man on a cliff, Fjords in background, award winning, professional”
“photo of a man in Southern France, wearing a plain red t - shirt and darkblue shorts, Provence, lavender fields in background, hdr”
“photo of a w0ut3r man in Southern France, wearing a plain red t - shirt and darkblue shorts, Provence, lavender fields in background, hdr”

Conclusion for complex photos

The findings here, are exactly the same as with the simple photos. Some of the “bad” Lora’s of the simple photos are performing worse here, but the good ones are still good. They’re not slower to train than the simple photos, which surprised me.

The fasted trained model with good results is where the input is captioned and the regularization images are real photos. (v004 and v005).
Captioning my input images is the best chance of good quality and faster learning than no input captions.
I can get away without regularization images, as long as I caption my input images (v001).
Without input captions, don’t bother with regularization captions.
Only acceptable result without input captions is with real photo regularization images (v012), not with generated images.

First epoch with a good line drawing of the subject

The next test is generating something else than a photo. I chose to do a line drawing. For this one, I added a negative prompt as well, to exclude photos in the result.

“line art drawing of a man, headshot, professional, sleek, modern, minimalist, graphic line art, vector graphics”
negative: “anime, photorealistic, realism, realistic”
“line art drawing of a w0ut3r man, headshot, professional, sleek, modern, minimalist graphic, line art, vector graphics”
negative: “anime, photorealistic, realism, realistic”

All of the models were able to generate a line drawing that looks like me. However, some of the models started failing this task in a later epoch.

Conclusion for line drawings

All versions were able to generate a line drawing of me, most of them after 4 epochs.
Without input captions, the line drawing almost always becomes a photo after longer training. The only exception here is v009, where the regularization images were generated with the captions of the input images AND the captions are present.

3D Cartoon of the subject

My final prompt is generating a 3D cartoon version, which roughly tests the same as the line drawing.

“man looking into the camera by Pixar and Dreamworks, cinematic, muted colors, faded”
negative: “photo, photographic, photorealistic, realism”
“w0ut3r man looking into the camera by Pixar and Dreamworks, cinematic, muted colors faded”
negative: “photo, photographic, photorealistic, realism”

This was the hardest one for the models: most of them didn’t look like me at full Lora strength, and some of the models generated photos, even with the negative prompts.

Conclusion for 3D characters

Only the version with input captions, and no regularization images (v001) stays a 3D cartoon in every epoch at full Lora strength.
With input captions, all versions are able to do 3D characters, with the exception of real-photo regularization images without captions.
Without input captions, I need high-quality generated regularization images with captions.

Update: Styling test (August 15, 2023)

I wasn’t entirely satisfied with the styling test (line drawing and 3D render). Especially the 3D render was botched by the fact that I used keywords from my captions (“looking into the camera”) in the beginning of my prompt, which caused the Lora to overdose on the photo style.

I took a look at the Fooocus project and defined a couple of prompts based upon their style prompts:

“line art drawing of a w0ut3r man, headshot, professional, sleek, modern, minimalist, graphic, line art, vector graphics”, negative: “anime, photorealistic, realism, realistic”
“hyper-detailed 3D render of cartoon w0ut3r man in a confident expressive pose, cartoon eyes, caricature, exaggerated facial features, cute cartoon style, white background”, negative: ""
“graffiti style portrait of a w0ut3r man . street art, vibrant, urban, detailed, tag, mural”, negative: “ugly, deformed, noisy, blurry, low contrast, realism, photorealistic”
“impressionist painting of a w0ut3r man . loose brushwork, vibrant color, light and shadow play, captures feeling over form”, negative: “anime, photorealistic, 35mm film, deformed, glitch, low contrast, noisy”
“Pop Art style portrait of a w0ut3r man . bright colors, bold outlines, popular culture themes, ironic or kitsch”, negative: “ugly, deformed, noisy, blurry, low contrast, realism, photorealistic, minimalist”
“Renaissance style portrait painting of a w0ut3r man . realistic, perspective, light and shadow, religious or mythological themes, highly detailed”, negative: “ugly, deformed, noisy, blurry, low contrast, modernist, minimalist, abstract”
“watercolor painting portrait of w0ut3r man . vibrant, beautiful, painterly, detailed, textural, artistic”, negative: “anime, photorealistic, 35mm film, deformed, glitch, low contrast, noisy”
“GTA-style artwork portrait of a w0ut3r man . satirical, exaggerated, pop art style, vibrant colors, iconic characters, action-packed”, negative: “realistic, black and white, low contrast, impressionist, cubist, noisy, blurry, deformed”
“retro arcade style portrait of a w0ut3r man . 8-bit, pixelated, vibrant, classic video game, old school gaming, reminiscent of 80s and 90s arcade games”, negative: “modern, ultra-high resolution, photorealistic, 3D”
“stained glass style portrait of a w0ut3r man . vibrant, beautiful, translucent, intricate, detailed”, negative: “ugly, deformed, noisy, blurry, low contrast, realism, photorealistic”

The results of these prompts without any Lora applied look like this:

I did extra renders of Lora versions v001, v002d, v004, v005 and v012. These were the Lora’s that were able to generate high quality photos.

It made no sense to spend extra time on the other Lora’s, as they were not able to generate good photos based upon photo input.

Styles v001

table with renders of v001 — style prompts v001 - epoch 003 is earliest version where it picks up my look accross styles. From epoch 008 on, the renaissance painting becomes a photo.

Styles v002d

table with renders of v002d — style prompts v002d - epoch 007 is where it starts to pick up my look accross styles, although it is still struggling with the pop-art version, which it only starts nailingin epoch 010. Maybe this network needs even longer training?

Styles v004

table with renders of v004 — style prompts v004 - epoch 004 is ok, apart from the pop-art. The stained glass look starts getting lost in further epochs

Styles 005

table with renders of v005 — style prompts v005 - the 8-bit and stained glass renders become a photo quite early, the line drawing as well in epochs 009 and 010 - looks like this one is worse off than 004 (which has captions for the regularization images). The least-bad version here is epoch 004 for me.

Style v012

table with renders of v012 — style prompts v012 - this one is even worse - photos (almost) accross the board starting from epoch 4. No real pick here...

Conclusions

General

The highest chance of success, is with input captions.
If I don’t care about leaking into the main class, don’t bother with regularization images. The Lora is most flexible at that point
The only way my special keyword w0ut3r didn’t leak, is when I generated high-quality regularization images using the same captions (without the keyword) as the input images.

That’s all folks

I hope this post gave you some insights in what approaches you can take to training your own subject Lora using Dreambooth. That being said, I’m not an expert in this field, and I’m sure there are other ways to get good results. If you have any questions, feel free to leave a comment below.

wouter riding a unicorn — prompt: w0ut3r man riding a rainbow unicorn, cinematic, dramatic - lora v001, epoch 6