At Curalate, we’re helping the world’s greatest brands manage and monetize their digital imagery. Whether it’s helping our clients leverage user-generated content in new and interesting ways or analyzing the billions of images being shared daily across social media for micro and macro trends, we’re solving tough problems at the intersection of big data and computer vision.
Building Curalate’s Analytics Platform
We have five core product values that we use as a litmus test for everything we build. One of them is “intelligence” — we believe that every product or feature we release should feel a little smarter than our users would expect; put another way, there should be just a trace of magic in everything we ship.
To get this kind of intelligence, we occasionally have to hit the books (or academic papers). As an example, in building our analytics platform, we realized that we needed a foolproof way to de-duplicate similar images. When we say “similar” images, we’re talking about images that may have been cropped, scaled, color corrected, etc., but that would be clearly identical to a human eye. This problem is trivial at small scale — you use some Fourier transforms to convert each image into a perceptual/fuzzy hash, and then you look for nearby hashes using any sort of search algorithm. Unfortunately, the trivial approach requires O(n^2) running time — a problem given that we’re processing well north of 200 million images/day.
Our research engineering team hit the books, and found a 2012 paper that appeared to solve the issue of running time. The proposed solution, however, was still not ready for the volume of images we needed to throw at it — it relied on a single in-memory database, which wouldn’t scale to the billions of images we needed it to. Our dev team re-conceived the algorithm using a distributed key-value store (specifically, DynamoDB), and was able to build a system that satisfied all of our requirements, had a constant run time and great performance characteristics.
Today this de-duplication service is core part of our infrastructure, and our internal consumers of it are able to query a library of billions of known images for duplicates in under 20ms. Our clients never see this internal service, but it has powered some pretty magical end-user experiences which would have otherwise never been possible.
Subscribe to our newsletter
We have a weekly publication called Ruff Notes where we share original content, curated articles, and news from our community.