Curalate: Helping the World's Greatest Brands Monetize Their Digital Imagery

 | 

Featured post by Nick Shiftan
Co-Founder and CTO, Curalate

At Curalate, we’re helping the world’s greatest brands manage and monetize their digital imagery. Whether it’s helping our clients leverage user-generated content in new and interesting ways or analyzing the billions of images being shared daily across social media for micro and macro trends, we’re solving tough problems at the intersection of big data and computer vision.

image

Tech Stack

While we aren’t language zealots at Curalate — we believe in using the right tool for the job — today we do have a pretty cool tech stack. Our languages of choice are Scala and JavaScript (AngularJS). Our primary databases are Cassandra (~100 nodes), DynamoDB, MySQL, and Redshift. On the backend, we’ve invested heavily in Storm (real-time processing) and Finagle (micro-services). Finally, everything is hosted with AWS, and we use Asgard and a slew of other tools to manage deployments.

Building Curalate’s Analytics Platform

We have five core product values that we use as a litmus test for everything we build. One of them is “intelligence” — we believe that every product or feature we release should feel a little smarter than our users would expect; put another way, there should be just a trace of magic in everything we ship.

To get this kind of intelligence, we occasionally have to hit the books (or academic papers). As an example, in building our analytics platform, we realized that we needed a foolproof way to de-duplicate similar images. When we say “similar” images, we’re talking about images that may have been cropped, scaled, color corrected, etc., but that would be clearly identical to a human eye. This problem is trivial at small scale — you use some Fourier transforms to convert each image into a perceptual/fuzzy hash, and then you look for nearby hashes using any sort of search algorithm. Unfortunately, the trivial approach requires O(n^2) running time — a problem given that we’re processing well north of 200 million images/day.

image

Our research engineering team hit the books, and found a 2012 paper that appeared to solve the issue of running time. The proposed solution, however, was still not ready for the volume of images we needed to throw at it — it relied on a single in-memory database, which wouldn’t scale to the billions of images we needed it to. Our dev team re-conceived the algorithm using a distributed key-value store (specifically, DynamoDB), and was able to build a system that satisfied all of our requirements, had a constant run time and great performance characteristics.

Today this de-duplication service is core part of our infrastructure, and our internal consumers of it are able to query a library of billions of known images for duplicates in under 20ms. Our clients never see this internal service, but it has powered some pretty magical end-user experiences which would have otherwise never been possible.

Curalate, an Underdog.io customer, is hiring.