ControlNet on diffusers
ControlNet on diffusers
参考:https://huggingface.co/docs/diffusers/using-diffusers/controlnet v0.24.0
ControlNet 通过输入给 diffusion 模型一个额外的输入图作为条件,来控制生成图的结果。这个条件输入图可以是各种形式,如 canny 边缘、用户的手稿、人体姿态、深度图等。这无疑非常有用,我们终于能更好地控制生成图的结果了,而无需再去反复调一些文本 prompt 或去噪步数之类的参数来抽奖。
ControlNet 模型包含了两组参数(或者称为 块),通过一个零卷积(zero convolution)相连:
- 一组拷贝过来的参数是固定的,从而保证原本预训练的 diffusion 模型的能力不会丢失
- 另一组拷贝过来的参数是可训练的,会根据额外的条件输入图来进行训练
由于固定的这份拷贝保持了原来预训练模型的能力,因此基于新的条件训练一个 ControlNet 就跟微调其他模型一样快,因为我们不需要从头训练。
本文将介绍如何在 text-to-image、image-to-image、inpainting 等中使用 ControlNet。当然,ControlNet 还有其他很多不同的条件形式,可以自行探索一下。
开始之前,记得安装如下依赖包:
pip install -q diffusers transformers accelerate opencv-python
text-to-image
对于一般的 text-to-image 生成,我们只需要传入一个文本 prompt。但是如果有 ControlNet,我们还可以通过额外的输入图来设定其他的条件。这里,我们以 Canny 边缘条件为例。Canny 是一种边缘算子,可以描绘出图像的边缘轮廓。将其作为条件输入图,生成出来的结果可以在文本 prompt 的基础上,保持同样的边缘轮廓。
首先我们用过 opencv 来提取 Canny 边缘图:
from diffusers.utils import load_image, make_image_grid
from PIL import Image
import cv2
import numpy as np
original_image = load_image(
"https://hf.co/datasets/huggingface/documentation-images/resolve/main/diffusers/input_image_vermeer.png"
)
image = np.array(original_image)
low_threshold = 100
high_threshold = 200
image = cv2.Canny(image, low_threshold, high_threshold)
image = image[:, :, None]
image = np.concatenate([image, image, image], axis=2)
canny_image = Image.fromarray(image)
然后,加载一个以 Canny 边缘作为条件的 ControlNet 模型,并将其传入 StableDiffusionControlNetPipeline :
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel, UniPCMultistepScheduler
import torch
controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-canny", torch_dtype=torch.float16, use_safetensors=True)
pipe = StableDiffusionControlNetPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5", controlnet=controlnet, torch_dtype=torch.float16, use_safetensors=True
)
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
pipe.enable_model_cpu_offload()
写一个 prompt,将其与 Canny 边缘条件图一起输入到 pipeline 中:
output = pipe(
"the mona lisa", image=canny_image
).images[0]
make_image_grid([original_image, canny_image, output], rows=1, cols=3)

可以看到在生成结果中,不仅有 mona lisa 的风格,而且保持了原图的边缘轮廓。
image-to-image
对于一般的 image-to-image 生成,我们输入一张初始图和一个文本 prompt,来生成一张新图。在使用 ControlNet 时,我们还可以传入一张条件输入图,来引导模型。这次我们用深度图引导来做个示例。深度图包含了图片的空间深度信息。通过输入深度图作为引导,模型的生成结果就能保持原图的空间信息。
首先使用 transformers 中的 depth-estimation pipeline 来提取原图的深度图:
import torch
import numpy as np
from transformers import pipeline
from diffusers.utils import load_image, make_image_grid
image = load_image(
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/controlnet-img2img.jpg"
)
def get_depth_map(image, depth_estimator):
image = depth_estimator(image)["depth"]
image = np.array(image)
image = image[:, :, None]
image = np.concatenate([image, image, image], axis=2)
detected_map = torch.from_numpy(image).float() / 255.0
depth_map = detected_map.permute(2, 0, 1)
return depth_map
depth_estimator = pipeline("depth-estimation")
depth_map = get_depth_map(image, depth_estimator).unsqueeze(0).half().to("cuda")
然后加载一个以深度图为条件的 ControlNet,并将其传给 StableDiffusionControlNetImg2ImgPipeline 。
from diffusers import StableDiffusionControlNetImg2ImgPipeline, ControlNetModel, UniPCMultistepScheduler
import torch
controlnet = ControlNetModel.from_pretrained("lllyasviel/control_v11f1p_sd15_depth", torch_dtype=torch.float16, use_safetensors=True)
pipe = StableDiffusionControlNetImg2ImgPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5", controlnet=controlnet, torch_dtype=torch.float16, use_safetensors=True
)
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
pipe.enable_model_cpu_offload()
然后将文本 prompt,输入图和深度图一起传入 pipeline,得到生成结果:
from diffusers.utils import pt_to_pil
output = pipe(
"lego batman and robin", image=image, control_image=depth_map,
).images[0]
make_image_grid([image, pt_to_pil(depth_map)[0], output], rows=1, cols=3)

可以看到生成的结果中,人物符合文本 prompt 的描述,轮廓符合原图,前背景也符合深度图的空间信息。
inpainting
在 inpainting 中,一般我们要传入原图,掩码图和文本 prompt。ControlNet 中我们可以指定一个额外的条件图。这次我们用 inpainting mask 作为条件。这样,ControlNet 就能根据 mask,只在掩码区域内生成图像。
首先加载一张原图和掩码图:
from diffusers.utils import load_image, make_image_grid
init_image = load_image(
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/controlnet-inpaint.jpg"
)
init_image = init_image.resize((512, 512))
mask_image = load_image(
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/controlnet-inpaint-mask.jpg"
)
mask_image = mask_image.resize((512, 512))
然后写一个生成掩码条件图的函数,该函数将原图中像素值高于阈值的位置掩码掉:
import numpy as np
import torch
def make_inpaint_condition(image, image_mask):
image = np.array(image.convert("RGB")).astype(np.float32) / 255.0
image_mask = np.array(image_mask.convert("L")).astype(np.float32) / 255.0
assert image.shape[0:1] == image_mask.shape[0:1]
image[image_mask > 0.5] = 1.0 # set as masked pixel
image = np.expand_dims(image, 0).transpose(0, 3, 1, 2)
image = torch.from_numpy(image)
return image
control_image = make_inpaint_condition(init_image, mask_image)
然后创建一个以 inpainting 为条件的 ControlNet,并将其传入到 StableDiffusionControlNetInpaintPipeline 中:
from diffusers import StableDiffusionControlNetInpaintPipeline, ControlNetModel, UniPCMultistepScheduler
controlnet = ControlNetModel.from_pretrained("lllyasviel/control_v11p_sd15_inpaint", torch_dtype=torch.float16, use_safetensors=True)
pipe = StableDiffusionControlNetInpaintPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5", controlnet=controlnet, torch_dtype=torch.float16, use_safetensors=True
)
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
pipe.enable_model_cpu_offload()
最后将文本 prompt,原图,掩码图,条件图都传入 pipeline 进行生成:
output = pipe(
"corgi face with large ears, detailed, pixar, animated, disney",
num_inference_steps=20,
eta=1.0,
image=init_image,
mask_image=mask_image,
control_image=control_image,
).images[0]
make_image_grid([init_image, mask_image, pt_to_pil(control_image)[0], output], rows=1, cols=4)

可以看到,模型在指定区域内进行重绘,生成新图,且符合文本 prompt 的语义。、
Guess Mode
在 Guess Mode 下,我们无需为 ControlNet 提供 prompt(如果提供了 prompt,也不会有作用),这使得 ControlNet 必须尽自己所能来猜测(guess)输入的条件图是什么(深度图?人体姿势?还是 Canny 边缘等)。
Guess mode adjusts the scale of the output residuals from a ControlNet by a fixed ratio depending on the block depth. The shallowest
DownBlockcorresponds to 0.1, and as the blocks get deeper, the scale increases exponentially such that the scale of theMidBlockoutput becomes 1.0.
在 pipeline 中将 guess_mode 设置为 True,guidance_scale 建议在 3 - 5 之间。
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
from diffusers.utils import load_image, make_image_grid
import numpy as np
import torch
from PIL import Image
import cv2
controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-canny", use_safetensors=True)
pipe = StableDiffusionControlNetPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", controlnet=controlnet, use_safetensors=True).to("cuda")
original_image = load_image("https://huggingface.co/takuma104/controlnet_dev/resolve/main/bird_512x512.png")
image = np.array(original_image)
low_threshold = 100
high_threshold = 200
image = cv2.Canny(image, low_threshold, high_threshold)
image = image[:, :, None]
image = np.concatenate([image, image, image], axis=2)
canny_image = Image.fromarray(image)
image = pipe("", image=canny_image, guess_mode=True, guidance_scale=3.0).images[0]
make_image_grid([original_image, canny_image, image], rows=1, cols=3)

ControlNet with Stable Diffusion XL
目前,SDXL 还没有太多配套的 ControlNet,但是 diffusers 官方已经训了两个适配 SDXL 的 ControlNet 模型,分别是以 Canny 边缘为条件的和以深度图为条件的。
这里我们以 Canny 边缘为示例。首先加载一张图片并生成其 canny 边缘图:
from diffusers import StableDiffusionXLControlNetPipeline, ControlNetModel, AutoencoderKL
from diffusers.utils import load_image, make_image_grid
from PIL import Image
import cv2
import numpy as np
import torch
original_image = load_image(
"https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/hf-logo.png"
)
image = np.array(original_image)
low_threshold = 100
high_threshold = 200
image = cv2.Canny(image, low_threshold, high_threshold)
image = image[:, :, None]
image = np.concatenate([image, image, image], axis=2)
canny_image = Image.fromarray(image)
make_image_grid([original_image, canny_image], rows=1, cols=2)
然后加载适配于 SDXL 的 Canny 的 ControlNet 模型,并将其传给 StableDiffusionXLControlNetPipeline。
然后同样是传入 prompt 和 canny 条件图到 pipeline 中,进行生图:
prompt = "aerial view, a futuristic research complex in a bright foggy jungle, hard lighting"
negative_prompt = 'low quality, bad quality, sketches'
image = pipe(
prompt,
negative_prompt=negative_prompt,
image=canny_image,
controlnet_conditioning_scale=0.5,
).images[0]
make_image_grid([original_image, canny_image, image], rows=1, cols=3)

SDXL 的 ControlNet pipeline 也可以使用 guess mode:
from diffusers import StableDiffusionXLControlNetPipeline, ControlNetModel, AutoencoderKL
from diffusers.utils import load_image, make_image_grid
import numpy as np
import torch
import cv2
from PIL import Image
prompt = "aerial view, a futuristic research complex in a bright foggy jungle, hard lighting"
negative_prompt = "low quality, bad quality, sketches"
original_image = load_image(
"https://hf.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/hf-logo.png"
)
controlnet = ControlNetModel.from_pretrained(
"diffusers/controlnet-canny-sdxl-1.0", torch_dtype=torch.float16, use_safetensors=True
)
vae = AutoencoderKL.from_pretrained("madebyollin/sdxl-vae-fp16-fix", torch_dtype=torch.float16, use_safetensors=True)
pipe = StableDiffusionXLControlNetPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0", controlnet=controlnet, vae=vae, torch_dtype=torch.float16, use_safetensors=True
)
pipe.enable_model_cpu_offload()
image = np.array(original_image)
image = cv2.Canny(image, 100, 200)
image = image[:, :, None]
image = np.concatenate([image, image, image], axis=2)
canny_image = Image.fromarray(image)
image = pipe(
prompt, negative_prompt=negative_prompt, controlnet_conditioning_scale=0.5, image=canny_image, guess_mode=True,
).images[0]
make_image_grid([original_image, canny_image, image], rows=1, cols=3)
MultiControlNet
我们可以组合多种 ControlNet 条件,从而实现 MultiControlNet,以下建议一般有助于得到更好的结果:
- 对各条件进行掩码,保证他们之间不要有重叠(比如说在有人体姿势图的位置,就不要再有 Canny 边缘信息了)
- 可能需要调一下 controlnet_conditioning_scale 这个参数。
要使用 MultiControlNet,记得现将 SDXL 替换成 SD1.5。
本例中,我们将结合结合 Canny 边缘和人体姿态来生成图片。
首先准备一个图像 Canny 边缘:
from diffusers.utils import load_image, make_image_grid
from PIL import Image
import numpy as np
import cv2
original_image = load_image(
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/landscape.png"
)
image = np.array(original_image)
low_threshold = 100
high_threshold = 200
image = cv2.Canny(image, low_threshold, high_threshold)
# 中间这块位置我们一会儿要放人体姿态条件图,在Canny中先mask掉(建议一)
zero_start = image.shape[1] // 4
zero_end = zero_start + image.shape[1] // 2
image[:, zero_start:zero_end] = 0
image = image[:, :, None]
image = np.concatenate([image, image, image], axis=2)
canny_image = Image.fromarray(image)
make_image_grid([original_image, canny_image], rows=1, cols=2)

准备人体姿态图,首先装一下 controlnet_aux 包:
# !pip install -q controlnet-aux
from controlnet_aux import OpenposeDetector
openpose = OpenposeDetector.from_pretrained("lllyasviel/ControlNet")
original_image = load_image(
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/person.png"
)
openpose_image = openpose(original_image)
make_image_grid([original_image, openpose_image], rows=1, cols=2)

加载要用的 Canny 和 人体姿势 的两个 ControlNet,放到一个列表里并传给 pipeline:
from diffusers import StableDiffusionXLControlNetPipeline, ControlNetModel, AutoencoderKL, UniPCMultistepScheduler
import torch
controlnets = [
ControlNetModel.from_pretrained(
"thibaud/controlnet-openpose-sdxl-1.0", torch_dtype=torch.float16
),
ControlNetModel.from_pretrained(
"diffusers/controlnet-canny-sdxl-1.0", torch_dtype=torch.float16, use_safetensors=True
),
]
vae = AutoencoderKL.from_pretrained("madebyollin/sdxl-vae-fp16-fix", torch_dtype=torch.float16, use_safetensors=True)
pipe = StableDiffusionXLControlNetPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0", controlnet=controlnets, vae=vae, torch_dtype=torch.float16, use_safetensors=True
)
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
pipe.enable_model_cpu_offload()
然后,将文本 prompt,Canny 边缘图和人体姿势图传入,进行生成:
prompt = "a giant standing in a fantasy landscape, best quality"
negative_prompt = "monochrome, lowres, bad anatomy, worst quality, low quality"
generator = torch.manual_seed(1)
images = [openpose_image.resize((1024, 1024)), canny_image.resize((1024, 1024))]
images = pipe(
prompt,
image=images,
num_inference_steps=25,
generator=generator,
negative_prompt=negative_prompt,
num_images_per_prompt=3,
controlnet_conditioning_scale=[1.0, 0.8],
).images
make_image_grid([original_image, canny_image, openpose_image,
images[0].resize((512, 512)), images[1].resize((512, 512)), images[2].resize((512, 512))], rows=2, cols=3)

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。 如若内容造成侵权/违法违规/事实不符,请联系我的编程经验分享网邮箱:veading@qq.com进行投诉反馈,一经查实,立即删除!