Spring Boot Interview Question — Building an Ad Impression Platform

Why raw event storage alone breaks at scale

Suraj Mishra

Jun 02, 2026

Scenario

You are building an ad analytics platform similar to Google Ads or Meta Ads.

When an advertisement is rendered on a publisher page, your system records an impression.

Advertisers are billed based on impressions and clicks.

Current flow:

Browser
   ↓
Ad Service
   ↓
Impression API
   ↓
PostgreSQL
   ↓ (CDC)
Kafka
   ↓
Analytics/Billing Systems

When an ad is displayed, the frontend sends:

Payload:

{
  "adId": "AD123",
  "campaignId": "CMP9",
  "userId": "USER42",
  "timestamp": 1717000000
}

The system records:

impression count
advertiser billing
campaign analytics

Production Incident

Advertisers complain:

We are being charged for duplicate impressions.

Investigation reveals:

mobile clients retry requests
network failures cause duplicate sends
CDN occasionally retries requests
Kafka consumers replay events during failures

Result:

Same impression counted multiple times.

API layer implementation

How would you redesign this system so that

Each impression is counted exactly once
Discuss the data model for that

It’s hard to enforce exactly once across

HTTP retries
DB writes
CDC
Kafka consumers

So we must define:

Single source of truth = PostgreSQL
Everything else = derived, replayable projections

For idempotency , we need a globally unique identifier such as impression_id , which should be generated by ad_service which is own by us.

This is:

created when ad is served
returned to browser
sent back in impression API

Then at the write layer, we can enforce constraint on the data model to keep the impression_id unique which should prevent any duplicate entry to our table.

Data model:

CREATE TABLE impressions (                    
    impression_id UUID PRIMARY KEY,
    ad_id UUID NOT NULL,
    campaign_id UUID NOT NULL,
    user_id UUID NOT NULL,

    created_at TIMESTAMP NOT NULL
);

The primary key enforces idempotency at DB level.

Now with this we at least ensure our source of truth do not contains duplicates.

But with check at DB level pushes something very late in the stack which we can prevent early in request. We can have distributed cache like redis where we can store impression_id as key to check if they already exist in defined TTL and skip upon existence. This is good design and help us in two things:

Early termination of request
Prevent executing query on database that should be failing anyway.

High Level Design

📢 Get actionable Java and Spring Boot insights every week, including practical code tips and real-world, use-case-based interview questions, to help you level up your backend skills—join 7300+ subscribers for hand-crafted, no-fluff content.

First 100 paid subscribers will get the annual membership at $50/year forever that is ~ $4/mo ( 92 already converted to paid, 8 remaining)

So far we have covered 60 real world based interview questions and will add up to 100 by end of this year.

Testimonials

How do you handle idempotency in stream processing + analytical layer to prevent double charging?

Kafka is treated as:

at-least-once event delivery system

So duplicates may still occur. We do not rely on Kafka for correctness.

We should implement idempotency at stream processing and analytical layer.

Stream processing layer like flink, beam can maintain store that can store see impression_id for defined TTL.

if impressionId not in state:
    emit event downstream
    mark as seen
else:
    drop

partitioned by impressionId (keyBy)
stored locally per worker
TTL applied (e.g., 1 hour)

So we get deduplicated event stream for analytics.

Flink Java implementation:

Even idempotency at this layer cannot guarantee or prevent duplicates fully. We can still have worker failed, retry outside of TTL or late arrival data which essentially skips seen impression_id. Hence we should also make sure our analytical query perfrom uniqness/distinct count over impression_id when computing the metrics.

SELECT
  campaign_id,
  COUNT(DISTINCT impression_id)
FROM impressions
GROUP BY campaign_id;

Why not just store raw events and compute everything using SQL in Analytical Database like BigQuery?

We could store:

impression events → BigQuery (raw table)

Then compute everything:

SELECT campaign_id, COUNT(DISTINCT impression_id)
FROM impressions
GROUP BY campaign_id;

So functionally It works.

Query cost becomes explosive

In systems like Google BigQuery:

cost is based on bytes scanned
raw event tables grow to billions–trillions of rows

So every query becomes:

scan massive dataset → group → aggregate

Even simple dashboards become expensive and unpredictable.

Repeated computation of the same logic

Every team writes:

CTR calculation
impression counting
dedup logic
filtering rules

Example:

COUNT(*)
COUNT(DISTINCT impression_id)
filtered COUNT(*)

So the same business logic is recomputed everywhere.

Latency for dashboards

Raw computation means:

scan TBs → aggregate → return result

This is too slow for:

real-time dashboards
advertiser portals
fraud monitoring

These issues reveals deeper org level inefficiency.

Raw event querying pushes all complexity to query time:

storage is simple
computation is repeated everywhere

This does not scale organizationally.

Overall Design

Thats all for this week friends! Thanks for reading this far. If you liked it please share with your network.

Before you leave, don’t forget to unlock all the real world use case based interview questions , just at $4/mo (with annual subscription)

Happy Coding 🚀
Suraj

Subscribe | Sponsor us | LinkedIn | Twitter

Engineering With Java

Discussion about this post

Ready for more?