January 13, 2025
Welcome to Lesson 12 of 12 in our free course series, LLM Twin: Building Your…
OpenAI’s CLIP model and related techniques have taken the field of machine learning by storm since the group released their first blog post about the model in January 2021. I highly recommend that original post as an introduction to the big ideas of CLIP, if you haven’t read it already—but CLIP’s architecture and training process are outside the scope of this tutorial.
We’re going to focus on how to use the clip
Python library to make similarity comparisons between text and images with pre-trained CLIP models. This basic functionality is at the heart of popular and highly sophisticated CLIP-based techniques. Notably, the model has been very popular in the AI art / generative images sphere, thanks to techniques originally made popular by artist and researcher Ryan Murdock. While the official CLIP repository provides a helpful Colab tutorial for “Interacting with CLIP,” this tutorial will start moving us toward using CLIP as a visual reasoning engine for generative work.
We’re going to use CLIP to compare individual solid colors to text prompts, and then see how our similarity score changes as we interpolate from one color to another.
In the cover image for this blog post, for instance, I interpolated from a yellow (RGB=(1,1,0)) to a pink (RGB = (1,0,1)), and at each step, had CLIP report the similarity between that color and the phrase “A glass of lemonade.”
Higher scores denote greater similarity, so the bar chart tells us that the bright yellow is the best match for the text prompt. However, we see that as we fade from yellow to pink, some of the pink tones start to match “lemonade” very well, as well. It seems that the more bluish pinks closer to the right side of the plot start to get a little less similar to our text prompt.
We could imagine applying this idea toward optimizing a color. We provide a seed color—maybe a gray or a randomly generated RGB code—and we let CLIP guide us in adjusting that color to match it to a prompt as well as possible. We won’t do that in this tutorial, though. You’ll have to stay turned for part two of this series!
Prompt engineering plus Comet plus Gradio? What comes out is amazing AI-generated art! Take a closer look at our public logging project to see some of the amazing creations that have come out of this fun experiment.
Let’s go over installing CLIP first. The code shown below is intended for an IPython notebook, i.e. Colab. To run it in a bash shell, just remove the !
at the beginning of each line.
!pip install --quiet ftfy regex tqdm
!pip install --quiet git+https://github.com/openai/CLIP.git
First, we need to load a pre-trained CLIP model. The clip
library makes a number of models available, each corresponding to a slightly different architecture. Generally, there are two categories we’re dealing with: those models based on ResNet
architectures and those based on VisualTransformer
architectures.
# Should be one of ['RN50', 'RN101', 'RN50x4', 'RN50x16', 'ViT-B/32', 'ViT-B/16'] model_name = 'ViT-B/16' model, preprocess = clip.load(model_name)# Set to "evaluation mode" model.eval()
To encode text using a pre-trained CLIP model, there are a few things we need to do. The first is to tokenize the text as follows:
text = 'some text to encode'
tokenized_text = clip.tokenize(text)
Once the text is tokenized, it can be encoded using the pre-trained model’s text transformer.
encoded_text = model.encode_text(tokenized_text)
PyTorch adheres to a convention that may be unfamiliar if you’re used to working with images with PIL, NumPy, OpenCV, or TensorFlow. Say we have an RGB color image that is 640 pixels wide by 480 pixels tall. While PIL, NumPy, etc. would treat this as an array of shape (480, 640, 3)
, in PyTorch it would be (3, 480, 640)
.
This means that, if you load images using PIL, for instance, you must rearrange the dimensions to the proper order. This can be accomplished using the .permute()
method for torch tensors. For example:
img = Image.open('/path/to/image.png')
arr = np.array(img)
# Rearrange the dimensions from (HWC) to (CHW)
x = torch.tensor(arr).permute((2,0,1))
But that’s not the whole story! If we had a stack of 10 images of this size, we would store them in a tensor of shape (10, 3, 480, 640)
. In other words, PyTorch formats images as (N, Channel, Height, Width) or NCHW
. Moreover, many models are designed to work with batches of images, and you may need to convert a tensor of shape (C,H,W)
to (1,C,H,W)
. This can be accomplished using the .unsqueeze()
method as follows:
# Old x.shape = (C, H, W), new x.shape will be (1, C, H, W) # because .unsqueeze is expanding dimension 0 x = x.unsqueeze(0)# Similarly, we can do (?) x = x.squeeze(0)
With CLIP, our goal is to make multi-modal comparisons — more specifically, we want to measure similarity across images and text. CLIP learns its image and text encoder models together, and we can access the image encoder via the .encode_image
method of a trained CLIP model:
img = Image.open('/path/to/image.png')# We can just use the model's associated preprocesser function x = model.preprocess(img)encoded_image = model.encode_image(x)
Of course, for our example in this tutorial, we’re not working with existing images. We’re generating our own!
Let’s write a function that turns a color in floating-point RGB space into a properly formatted image tensor. In other words, we may have something like color = (1, 0, 0)
for red or color = (0.5, 0.5, 0.5)
for a 50% gray. Really, this comes down to instantiating a tensor object and reshaping it to NCHW
format. This function returns a 1×1 pixel image with RGB channels. More aptly, it outputs a stack containing a single image.
def create_rgb_tensor(color): """color is e.g. [1,0,0]""" return torch.tensor(color, device=DEVICE).reshape((1, 3, 1, 1))red_tensor = create_rgb_tensor((1, 0, 0))
But we can’t pass one of these into our visual encoder just yet. Recall that the CLIP model has a predefined resolution, which can be found with model.visual.input_resolution
.
For us, this input resolution is a height and width of (224,224)
. So let’s resize our RGB image. There are any number of ways we could do this using torch and torch-compatible libraries, but I’ll use the Resize
class from torchvision.transforms
.
from torchvision.transforms import Resize# Create the Resizer resolution = model.visual.input_resolution resizer = torchvision.transforms.Resize(size=(resolution, resolution))resized_tensor = resizer(red_tensor)torch.encode_image(resized_tensor)
We now have encoded text and encoded images. How do we measure the similarity between them? We could employ any number of possible similarity metrics, but we’ll take cosine similarity as a starting point.
similarity = torch.cosine_similarity(encoded_text, encoded_image)
We’re almost there…We can encode text. We can generate images from colors. We can resize those colors to the proper size for CLIP. And we can encode those images using CLIP.
Now, let’s write a function to interpolate between two colors. We’ll stick with linear interpolation, which simply “draws a line” between two endpoints. Note that the function below is called lerp
which stands for (L)inear int(ERP)olation. This abbreviation is a strong convention, in my experience, and you’re likely to see it when and where you see a linear interpolation function defined in code.
def lerp(x, y, steps=11): """Linear interpolation between two tensors """ weights = np.linspace(0, 1, steps) weights = torch.tensor(weights, device=DEVICE) weights = weights.reshape([-1, 1, 1, 1]) interpolated = x * (1 - weights) + y * weights return interpolatedblue_tensor = create_rgb_tensor((0,0,1)) color_range = lerp(red_tensor, blue_tensor, 11)color_range = resizer(color_range)similarities = torch.cosine_similarity(encoded_text, color_range)
There are a few more functions we need for turning these colors into a nice bar plot in Pandas, but I won’t describe them all in detail here. Here’s a code snippet detailing this process:
def rgb2hex(rgb): """Utility function for converting a floating point RGB tensor to a hexadecimal color code.""" rgb = (rgb * 255).astype(int) r,g,b = rgb return "#{:02x}{:02x}{:02x}".format(r,g,b)def get_interpolated_scores(x, y, encoded_text, steps=11): interpolated = lerp(x, y, steps) interpolated_encodings = model.encode_image(resizer(interpolated)) scores = torch.cosine_similarity(interpolated_encodings, encoded_text) scores = sc.detach().cpu().numpy() rgb = interpolated.detach().cpu().numpy().reshape(-1, 3) interpolated_hex = [rgb2hex(x) for x in rgb] data = pd.DataFrame({ 'similarity': scores, 'color': interpolated_hex }).reset_index().rename(columns={'index':'step'}) return datadef similarity_plot(data, text_prompt): title = f'CLIP Cosine Similarity Prompt="{text_prompt}"' fig, ax = plt.subplots() plot = data['similarity'].plot(kind='bar', ax=ax, stacked=True, title=title, color=data['color'], width=1.0, xlim=(0, 2), grid=False) plot.get_xaxis().set_visible(False) ; return fig
We can get this process working in a Colab notebook, but let’s talk about deploying this model as an interactive app. I’ll be using Gradio as a framework and Hugging Face Spaces to deploy.
I won’t go into great detail about writing a Gradio app, but here are the basics:
gradio_fn
. Note that this is just a Python function, and doesn’t actually require any Gradio specific stuff yet. This is simply what Gradio will do with the inputs we provide.gradio.inputs
module. In this example, we’re using Textbox
inputs to write RGB values and a Slider
to select the number of steps. Gradio’s simplicity is a big part of its appeal — the apps are not only simple to set up, but also have very uniform interfaces. There are other things that Gradio does an amazing job of handling for us, such as spinning up temporary, publicly shareable apps from a single line of code in a Colab notebook with no signup necessary! If we wanted to create something more custom, though—for instance if we wanted to use a color picker to select our RGB values—we might look at a framework like Streamlit. Streamlit is also supported by Hugging Face Spaces, and is overall excellent, but I went with Gradio this time, especially because it’s easy to test on Colab.gradio.Interface
object, which takes as arguments our function, inputs, and an output type. Then we simply call the .launch()
method! As I mentioned before, what’s really amazing is that this method will work from a Colab cell, spitting out a public link and opening an inline IFrame with our app. But it will also work in our .py
file on Hugging Face Spaces.def gradio_fn(rgb_start, rgb_end, text_prompt, steps=11, grad_disabled=True): rgb_start = [float(x.strip()) for x in rgb_start.split(',')] rgb_end = [float(x.strip()) for x in rgb_end.split(',')] start = create_rgb_tensor(rgb_start) end = create_rgb_tensor(rgb_end) encoded_text = encode_text(text_prompt) data = get_interpolated_scores(start, end, encoded_text, steps) return similarity_plot(data, text_prompt)gradio_inputs = [gr.inputs.Textbox(lines=1, default="1, 0, 0", label="Start RGB"), gr.inputs.Textbox(lines=1, default="0, 1, 0", label="End RGB"), gr.inputs.Textbox(lines=1, label="Text Prompt", default='A solid red square'), gr.inputs.Slider(minimum=1, maximum=30, step=1, default=11, label="Interpolation Steps")]iface = gr.Interface(fn=gradio_fn, inputs=gradio_inputs, outputs="plot") iface.launch()
What else can we do now that we can compare images and text using CLIP? The list is long, believe me, but here are some broad ideas:
Next time, we’ll look at using CLIP to guide the training of a PyTorch model to directly generate a color to best match a prompt. We’ll also start comparing aspects of the different pre-trained CLIP models in terms of performance and efficiency, and we’ll look at how to track the parameters and outputs of runs with Comet.