Prototyping an Image-Text Similarity Calculator to Measure Image-Text Similarity with Cohere Embed 4
Have you ever thought about “searching for text in design documents saved as images” or “linking screenshots with related documents”?
Embed 4, recently announced by Cohere, is a cutting-edge multimodal embedding model that handles not only text but also images. Using this, we can calculate semantic similarity between images and text, potentially solving the challenges mentioned above.
So this time, I’ve implemented a prototype Image-Text Similarity Calculator tool that calculates the similarity between images and text using Cohere Embed 4 in Node.js. This article introduces its overview, usage, and implementation highlights along with sample code.
Cohere Embed 4 is a powerful multimodal model that can embed both text and images in a common vector space. This makes it possible to directly compare “image content” and “text queries,” which was previously difficult.
Various applications are expected, including business document search, e-commerce product image search, and blueprint management.
What I created is a simple Node.js tool that takes an image file and a text query related to its content, then calculates the similarity between them.
For example, you can perform calculations like:
In this way, the AI understands the content in the image and quantifies its relevance to text.
The tool is publicly available on GitHub. You can easily set it up with the following steps.
This project assumes v24.8.0
. If you have a different version, please install from the official site.
node --version
npm install
You need to set up your Cohere API key.
cp .env.example .env
Next, get your API key from the Cohere dashboard and set it in the created .env
file.
# .env
COHERE_API_KEY="your_cohere_api_key_here"
That’s it for setup!
The usage is very simple.
Run with the path to an image file and the text query you want to compare as arguments.
# Calculate with one query
npm run similarity ./images/samples/ec-requirements.png "EC requirements"
When executed, you’ll see results like this in the console:
🚀 Starting image-text similarity calculation...
Image: ./images/samples/ec-requirements.png
Query: "EC requirements"
Calculating...
✅ Calculation complete!
==================================================
Similarity Score: 0.823456
Cosine Similarity (Range 0-1, closer to 1 means more similar)
==================================================
Interpretation: 🔥 Very high relevance
It displays not only the score but also an intuitive interpretation like 🔥 Very high relevance
.
You can also calculate similarities with multiple text queries for a single image at once.
npm run similarity ./images/samples/ec-requirements.png "EC site" "requirements" "cooking recipe"
In this case, results are displayed in ranking format sorted by similarity.
Similarity Ranking:
1. 🥇 "EC requirements": 0.823456
2. 🥈 "E-commerce system specifications": 0.687234
3. 🥉 "Requirements document": 0.654321
4. 📝 "Cooking recipe": 0.123456
Statistics:
Average Similarity: 0.572117
Max Similarity: 0.823456
Min Similarity: 0.123456
You can see that the unrelated query “cooking recipe” scores low, indicating that meanings are correctly captured.
For more detailed analysis, run the following command:
npm run validate
This command calculates similarities based on predefined test cases (combinations of multiple images and queries) and outputs the results to the reports/
directory in JSON and text formats.
The heart of this tool is divided into several classes:
ImageProcessor
: Converts images to Base64 format accepted by Cohere API and performs size checks.CohereEmbedding
: Communicates with Cohere Embed v4 API to get embedding vectors for images and text.SimilarityCalculator
: Calculates cosine similarity based on the obtained vectors.ReportGenerator
: Formats calculation results and outputs report files.The basic usage of SimilarityCalculator
, which plays a central role, looks like this:
// Example using src/similarity-calculator.js
import { SimilarityCalculator } from './src/similarity-calculator.js';
async function main() {
const calculator = new SimilarityCalculator();
const result = await calculator.calculateSimilarity(
'./images/samples/ec-requirements.png',
'EC requirements'
);
// Similarity: 0.823456
console.log(`Similarity: ${result.similarity.toFixed(6)}`);
}
You can see how simple it is to get the similarity between an image and text.
This time, I implemented a prototyping tool that calculates image-text similarity using Cohere’s new multimodal model Embed 4.
Using this technology, various possibilities open up:
The code introduced today is available on the GitHub repository. Please clone it and try it with your own images. You’ll surely experience the excitement of multimodal AI!
That’s all from the Gemba, where we implemented a prototype image-text similarity calculation tool using Cohere Embed 4.