A production-scale, multi-tenant blueprint for ingesting, normalizing, and analyzing public conversation across X, Facebook, Instagram, and TikTok — with geo-filtering, autonomous categorization, and entity-level competitive comparison. Tuned for the Kenyan market.
X is the correct first platform for a Kenyan listening product. Kenyans on Twitter (KOT) is the loudest, most text-dense, most quote-able public square in the country — and X is the only one of the four with a usable, paid, official firehose-style API. Nail the end-to-end pipeline on X, then clone the pattern to the harder platforms.
X's official API was repriced into tiers. As of 2025 the practical options for a startup:
| Path | What you get | Cost / limits | Verdict for production |
|---|---|---|---|
| X API — Basic | App-level read, recent search (~7-day), modest monthly post cap | ~$200/mo, ~10–15k posts/mo read | MVP only — caps too low for real listening |
| X API — Pro | Full-archive search, higher caps, filtered stream | ~$5,000/mo | Compliant production path once revenue justifies it |
| Managed providers (Apify, Bright Data, TwitterAPI.io-style) | Search + profile + timeline via their infra; they absorb proxy/ban risk | Usage-priced (per 1k results) | Pragmatic bridge: real coverage at MVP cost |
| Self-hosted stealth scraping | Max coverage, min data cost | Proxy + maintenance + ban risk | ToS-violating & fragile — avoid as primary |
SourceAdapter interface so the same normalization/dedup/NLP pipeline runs regardless of where bytes come from. Start on a managed provider (fast, affordable coverage), keep the official X API adapter ready behind a feature flag, and flip to X API Pro for the streams that need ToS guarantees. You never rewrite the pipeline — you only swap the adapter.keyword_rule table (Section 2 schema).XAdapter: takes a rule, returns raw posts. Provider-backed first.mentions row; dedup on content hash + platform id.geo JSONB.The starter worker that does steps 2–4 ships alongside this blueprint as x_ingestor.py.
A resilient ingestion tier is an adapter pattern behind a queue, not a pile of scrapers. Each platform is one adapter implementing a common contract; everything after the adapter is platform-agnostic.
| Platform | Official API | Reality | Production choice |
|---|---|---|---|
| X / Twitter | Yes (Basic→Pro→Enterprise) | Best official option of the four; archive search behind Pro | API (Pro) or managed provider |
| Graph API — only your own Pages/owned assets | No general public-post search since the old Public Feed/CrowdTangle wind-down (Meta Content Library is research-gated) | Meta Content Library if eligible; else managed provider for public Pages | |
| Graph API — business/creator accounts you manage + hashtag search (limited) | Hashtag Search API gives recent public media for a hashtag, with tight rate caps | Graph hashtag API + managed provider for breadth | |
| TikTok | Research API (gated) + Business/Display API | Research API is approval-only and region-limited; otherwise no clean public search | Managed provider; Research API if you qualify |
Net: only X gives you a clean pay-to-play public firehose. For the other three, a reputable managed provider is the realistic production source, with official APIs layered in where eligibility allows. Build for this asymmetry — don't assume parity across platforms.
fetch(rule, cursor) → (RawItem[], next_cursor). The orchestrator never knows whether bytes came from an API or a provider.x-rate-limit-remaining/reset headers and back off proactively.platform + native_post_id) means re-delivery is harmless. Critical for at-least-once queues.Explicit geo is rare. Resolve location with a confidence-scored cascade, stored as structured geo with a geo_confidence field so the UI can filter on trust level:
| Tier | Signal | Confidence |
|---|---|---|
| 1 | Native geotag / place object (X place, IG location) → reverse-geocode | High (0.9+) |
| 2 | Profile location field, parsed & gazetteer-matched ("Nai", "254", "Nrb" → Nairobi) | Medium (0.6–0.8) |
| 3 | In-text place mentions via NER + a Kenya gazetteer (Westlands, CBD, Kasarani, Mombasa…) | Medium (0.5–0.7) |
| 4 | Language/slang & currency cues (Sheng, "Ksh", "bob", KOT) → country-level prior | Low (0.3–0.4) |
| 5 | Network graph (who they reply to / who follows them) → cluster centroid | Low, but improves with volume |
Build a Kenya gazetteer (counties, towns, neighborhoods, malls, landmarks, common misspellings & Sheng aliases) as a Postgres table with trigram + full-text indexes. Combine signals in a small scoring function; never overwrite a high-confidence native tag with a low-confidence inference. Expose geo_confidence as a dashboard filter so a user can choose "Nairobi (confirmed)" vs "Nairobi (inferred)."
A Tweet, a TikTok caption, and a Facebook post are the same abstract object: an authored public utterance, at a time, maybe a place, that mentions things and carries engagement. Model that abstraction once; push platform quirks into JSONB.
| Canonical field | X / Tweet | TikTok | Facebook / IG |
|---|---|---|---|
content | tweet text | video description + caption | post message / caption |
author | user object | author | page/profile |
published_at | created_at | createTime | created_time |
engagement | like/retweet/reply/quote | digg/share/comment/play | reactions/shares/comments |
media | media entities | video URL + cover | attachments |
geo | place / coordinates | (rare) region code | place |
raw | the entire untouched provider/API payload (JSONB) — your insurance policy | ||
Normalization rule of thumb: promote to a typed column only what you filter, sort, or join on (tenant, platform, time, geo, sentiment, category). Everything else lives in raw JSONB and gets pulled forward later if it earns a query.
Postgres over Mongo here: you need multi-tenant relational integrity (tenants↔rules↔mentions), and JSONB flexibility, and vector search for semantic dedup/RAG — all in one engine via pgvector. Partition the hot table by time.
-- Tenancy ---------------------------------------------------------- CREATE TABLE tenants ( id uuid PRIMARY KEY DEFAULT gen_random_uuid(), name text NOT NULL, plan text DEFAULT 'mvp', created_at timestamptz DEFAULT now() ); CREATE TABLE users ( id uuid PRIMARY KEY DEFAULT gen_random_uuid(), tenant_id uuid REFERENCES tenants(id) ON DELETE CASCADE, email citext UNIQUE NOT NULL, role text DEFAULT 'member' ); -- What each tenant is listening for -------------------------------- CREATE TABLE keyword_rules ( id uuid PRIMARY KEY DEFAULT gen_random_uuid(), tenant_id uuid REFERENCES tenants(id) ON DELETE CASCADE, label text, -- "Unga prices" terms text[] NOT NULL, -- ['unga','bei ya unga','maize flour'] platforms text[] DEFAULT '{x}', -- start with X geo_filter jsonb, -- {"country":"KE","city":"Nairobi"} cursors jsonb DEFAULT '{}', -- per-platform since_id / pagination is_active boolean DEFAULT true, created_at timestamptz DEFAULT now() ); -- The canonical, platform-agnostic record -------------------------- CREATE TABLE mentions ( id uuid DEFAULT gen_random_uuid(), tenant_id uuid NOT NULL, rule_id uuid, platform text NOT NULL, -- x | facebook | instagram | tiktok native_post_id text NOT NULL, -- id on that platform content_hash text NOT NULL, -- sha256(normalized text) for dedup content text, lang text, author jsonb, -- {handle, display, followers, verified} engagement jsonb, -- {likes, shares, comments, views} media jsonb, geo jsonb, -- {country, city, lat, lng} geo_confidence real DEFAULT 0, sentiment real, -- -1..1 sentiment_label text, -- pos | neu | neg categories text[], -- ['RetailPricing','ConsumerSentiment'] entities jsonb, -- [{name:'Naivas',type:'BRAND',span:[..]}] embedding vector(768), -- pgvector: semantic dedup + RAG raw jsonb, -- untouched source payload published_at timestamptz NOT NULL, ingested_at timestamptz DEFAULT now(), PRIMARY KEY (id, published_at) ) PARTITION BY RANGE (published_at); -- Dedup + the indexes that make the dashboard fast ----------------- CREATE UNIQUE INDEX ux_mentions_dedup ON mentions (tenant_id, platform, native_post_id, published_at); CREATE INDEX ix_mentions_tenant_time ON mentions (tenant_id, published_at DESC); CREATE INDEX ix_mentions_categories ON mentions USING GIN (categories); CREATE INDEX ix_mentions_entities ON mentions USING GIN (entities jsonb_path_ops); CREATE INDEX ix_mentions_geo ON mentions USING GIN (geo); CREATE INDEX ix_mentions_embedding ON mentions USING hnsw (embedding vector_cosine_ops);
CREATE POLICY tenant_isolation ON mentions USING (tenant_id = current_setting('app.tenant_id')::uuid). The app sets app.tenant_id per request, so a query bug can't leak one customer's data into another's dashboard.(tenant, platform, native_post_id) kills re-delivery. content_hash catches cross-rule duplicates of the same post.embedding with an HNSW cosine threshold to cluster near-dupes into a thread/cluster_id, so Share-of-Voice counts distinct narratives, not raw volume.A tiered pipeline: cheap deterministic filters first, small fast models in the middle, an LLM only where ambiguity actually requires it. Running an LLM on every post is how you go bankrupt.
Ksh\s?\d[\d,]*, "bob", "/=", "k" for thousands), promo/discount terms, and a brand dictionary for the entities you already know you care about (Naivas, Quickmart, Carrefour, Chandarana, Cleanshelf…).bge-m3 / multilingual-e5) that handles English + Swahili + code-switching, stored in pgvector. Powers dedup, semantic search, and zero-shot categorization.entities table, so comparisons aggregate correctly.Define each category as (a) a short natural-language definition, (b) a handful of labeled examples, and (c) a prototype embedding (mean of example embeddings). At ingest:
Off-the-shelf English sentiment models fail badly on KOT. Concrete handling:
Reporting is aggregation over the canonical table, sliced by entity, category, geo, and time. Because sentiment is stored per entity+aspect, competitive matrices are a GROUP BY — not a second pipeline.
categories @> '{RetailPricing}' and entities contains the target brands.n and a freshness timestamp — never show a single noisy post as a "price."-- Share of Voice between competing brands, last 30 days, Nairobi SELECT e->>'name' AS brand, count(*) AS mentions, round(avg(sentiment)::numeric, 3) AS avg_sentiment FROM mentions m, jsonb_array_elements(m.entities) e WHERE m.tenant_id = current_setting('app.tenant_id')::uuid AND m.published_at > now() - interval '30 days' AND m.geo->>'city' = 'Nairobi' AND e->>'name' = ANY(ARRAY['Naivas','Carrefour','Quickmart']) GROUP BY brand ORDER BY mentions DESC;
geo_confidence.mentions for every dashboard load. Roll up into hourly/daily materialized views (or TimescaleDB continuous aggregates) keyed by (tenant, entity, category, geo, bucket). Dashboards read the rollups; the raw table is for drill-down only.Scalable, budget-conscious, boring-on-purpose. Every choice favors one operational surface and a clear upgrade path over premature microservices.
XAdapter + normalize/dedup + basic sentiment + a dashboard with SoV & sentiment time-series for one platform.