Prototyping an Image-Text Similarity Calculator to Measure Image-Text Similarity with Cohere Embed 4

Tadashi Shigeoka ·  Sat, September 13, 2025

Have you ever thought about “searching for text in design documents saved as images” or “linking screenshots with related documents”?

Embed 4, recently announced by Cohere, is a cutting-edge multimodal embedding model that handles not only text but also images. Using this, we can calculate semantic similarity between images and text, potentially solving the challenges mentioned above.

So this time, I’ve implemented a prototype Image-Text Similarity Calculator tool that calculates the similarity between images and text using Cohere Embed 4 in Node.js. This article introduces its overview, usage, and implementation highlights along with sample code.

🚀 What is Cohere Embed 4?

Cohere Embed 4 is a powerful multimodal model that can embed both text and images in a common vector space. This makes it possible to directly compare “image content” and “text queries,” which was previously difficult.

Various applications are expected, including business document search, e-commerce product image search, and blueprint management.

🎯 The Tool: Image-Text Similarity Calculator

What I created is a simple Node.js tool that takes an image file and a text query related to its content, then calculates the similarity between them.

🔗 Add Cohere Embed v4.0 image-text similarity calculator · Pull Request #17 · codenote-net/ai-llm-sandbox

For example, you can perform calculations like:

  • Similarity between an image of an EC site requirements document and the text “EC requirements” → 0.82
  • Similarity between a system architecture diagram image and the text “system architecture” → 0.76
  • Similarity between a Japanese document image and Japanese queries → 0.68

In this way, the AI understands the content in the image and quantifies its relevance to text.

📁 Project Setup

The tool is publicly available on GitHub. You can easily set it up with the following steps.

1. Check Node.js Version

This project assumes v24.8.0. If you have a different version, please install from the official site.

node --version

2. Install Dependencies

npm install

3. Configure Environment Variables

You need to set up your Cohere API key.

cp .env.example .env

Next, get your API key from the Cohere dashboard and set it in the created .env file.

# .env
COHERE_API_KEY="your_cohere_api_key_here"

That’s it for setup!

💻 How to Use the Tool

The usage is very simple.

Basic Similarity Calculation

Run with the path to an image file and the text query you want to compare as arguments.

# Calculate with one query
npm run similarity ./images/samples/ec-requirements.png "EC requirements"

When executed, you’ll see results like this in the console:

🚀 Starting image-text similarity calculation...
 
Image: ./images/samples/ec-requirements.png
Query: "EC requirements"
Calculating...
 
 Calculation complete!
==================================================
Similarity Score: 0.823456
Cosine Similarity (Range 0-1, closer to 1 means more similar)
==================================================
Interpretation: 🔥 Very high relevance

It displays not only the score but also an intuitive interpretation like 🔥 Very high relevance.

Batch Calculation with Multiple Queries

You can also calculate similarities with multiple text queries for a single image at once.

npm run similarity ./images/samples/ec-requirements.png "EC site" "requirements" "cooking recipe"

In this case, results are displayed in ranking format sorted by similarity.

Similarity Ranking:
1. 🥇 "EC requirements": 0.823456
2. 🥈 "E-commerce system specifications": 0.687234
3. 🥉 "Requirements document": 0.654321
4. 📝 "Cooking recipe": 0.123456
 
Statistics:
  Average Similarity: 0.572117
  Max Similarity: 0.823456
  Min Similarity: 0.123456

You can see that the unrelated query “cooking recipe” scores low, indicating that meanings are correctly captured.

Detailed Analysis and Report Output

For more detailed analysis, run the following command:

npm run validate

This command calculates similarities based on predefined test cases (combinations of multiple images and queries) and outputs the results to the reports/ directory in JSON and text formats.

🔍 Implementation Highlights

The heart of this tool is divided into several classes:

  • ImageProcessor: Converts images to Base64 format accepted by Cohere API and performs size checks.
  • CohereEmbedding: Communicates with Cohere Embed v4 API to get embedding vectors for images and text.
  • SimilarityCalculator: Calculates cosine similarity based on the obtained vectors.
  • ReportGenerator: Formats calculation results and outputs report files.

The basic usage of SimilarityCalculator, which plays a central role, looks like this:

// Example using src/similarity-calculator.js
import { SimilarityCalculator } from './src/similarity-calculator.js';
 
async function main() {
  const calculator = new SimilarityCalculator();
  const result = await calculator.calculateSimilarity(
    './images/samples/ec-requirements.png',
    'EC requirements'
  );
 
  // Similarity: 0.823456
  console.log(`Similarity: ${result.similarity.toFixed(6)}`);
}

You can see how simple it is to get the similarity between an image and text.

📈 Summary

This time, I implemented a prototyping tool that calculates image-text similarity using Cohere’s new multimodal model Embed 4.

Using this technology, various possibilities open up:

  • Advanced search systems for internal documents (finding related specifications from screenshots)
  • Applications for RAG (Retrieval-Augmented Generation) (providing image information as context to LLMs)
  • Automatic content classification (assigning tags based on image content)

The code introduced today is available on the GitHub repository. Please clone it and try it with your own images. You’ll surely experience the excitement of multimodal AI!

That’s all from the Gemba, where we implemented a prototype image-text similarity calculation tool using Cohere Embed 4.

References