November 8, 2024

Streamlining client onboarding with eKYC

We built a complete eKYC pipeline: reading IDs and their machine-readable zone, matching faces with a custom autoencoder, and checking liveness with active and passive anti-spoofing.

Goals

Identity verification is one of the harder problems in regulated onboarding. A new customer has to be checked against their documents, confirmed to be a real person rather than a photo or a mask, and cleared in a compliant way ideally in seconds, and preferably without a staff member having to touch every application.

In conversations with a European insurer, we saw what the manual version costs: their onboarding ran entirely by hand and took one to two days per applicant. The system had one rule we cared about most: make the call on its own when it can, and know when it can't. Anything uncertain gets flagged for a person to review rather than decided by a guess. Everything around that was designed with compliance in mind from the start.

Challenges

People photograph their IDs however they can: at an angle, in bad light, with glare bouncing off the laminate. Passports, national IDs and residence permits are all laid out differently. The machine-readable strip along the bottom follows a standard, which helps, but it's only useful if you read every character correctly.

Matching the face was harder. The photo on an ID might be five or ten years old, shot on a different camera in different light, and we're comparing it to a selfie taken on the spot. To a computer, those two images barely resemble each other. We needed the system to be confident it's the same person without being so suspicious that it turns away people who are exactly who they say they are.

Then there's liveness, where it gets adversarial. A clear photo of the right face will sail through a face matcher, so matching on its own proves nothing. We had to confirm there was a real, living person in front of the camera, and that it would hold up when someone tried to cheat with a printed photo, a video playing on a screen, or a mask. This part is never really finished. It's a back-and-forth with whoever's trying to get past it.

And the system had to know its own limits. In something this regulated, confidently approving the wrong person is far worse than admitting you're not sure. The obvious cases were easy. The real work was spotting the ones that weren't and sending those to a human, instead of forcing an answer.

Building the pipeline

From the applicant's side it's three steps: photograph your ID, take a selfie, and follow a short prompt to show you're really there. Underneath, the work happens in stages. The system reads the data off the document, checks the selfie against the photo printed on the ID, and confirms the person in front of the camera is real and present. Then it either returns a decision or sets the case aside for a human to review. The rest of this section goes through each stage and the choices behind it.

Reading the document

An ID's most reliable data lives in the machine-readable zone at the bottom: the two or three lines of monospaced characters that encode the document number, name, nationality, the relevant dates, and a few check digits. We read it with EasyOCR, after putting the image through OpenCV first to straighten it, fix the contrast, and crop down to just that strip. Phone photos need that cleanup before any OCR has a fair shot at them.

The check digits earn their keep here. The MRZ carries its own checksums, so once we'd read it we could verify the result against them instead of hoping every character came back right. When the numbers didn't add up, we knew the photo was bad and asked for another one, rather than carrying a misread through the rest of the pipeline.

Matching the face

Before you can compare two faces, you have to find them and line them up. We used MTCNN for that. It detects the face in each image and returns the key landmarks (eyes, nose, mouth corners), which let us align both the ID portrait and the selfie to the same orientation and scale. Skip that step and you end up comparing a head tilted one way against one tilted another, where the difference in pose drowns out the difference between people.

The matching itself ran on an autoencoder we trained ourselves. Instead of comparing raw pixels, it learns to squeeze a face down to a compact representation and rebuild it, and that compressed form is what we compared across the two images. Two photos of the same person land close together in that space even when the lighting, camera, and age don't line up; two different people land apart. It wasn't the only way to tackle face matching, but training our own gave us something we understood end to end and could tune for exactly this comparison.

Proving liveness

We checked for a live person two different ways, since the easy attacks and the harder ones call for different defenses.

The first is an active challenge. The app asks you to turn your head, looking up, then down, then left and right, and checks that you actually do it in the order it asked. A still photo can't follow instructions, and a pre-recorded clip won't match a prompt it had no way of knowing in advance. It's a simple check that quietly rules out a lot.

The second is passive and asks nothing of the user. We used MiniFASNet, a lightweight anti-spoofing model that looks at a single frame and decides whether it's a real face or a stand-in for one: a photo, a screen, a printed mask. It keys on the small giveaways, like the flatness of a print or the moiré and glare coming off a display. Running both together meant an attacker had to beat a behavioral test and a visual one at the same time.

Architecture

The models were built in PyTorch and served behind a FastAPI backend, with a Next.js frontend running the capture flow and the prompts. We ran everything in containers on our own GPU-enabled hardware, on premise rather than in the cloud, which kept the image data and the models together in one controlled place while we worked with them.

Results

We built a complete, working identity-verification system, end to end. It takes an applicant from a photo of their ID to a decision in about fifteen to thirty seconds: it reads the document, matches the face against the portrait on it, and confirms there's a live person on the other side of the camera. Across close to a hundred end-to-end runs it held together, clearing the straightforward cases on its own and routing the rest to a person. The same check had been taking the insurer a day or two by hand.