SocialPulse — Production Architecture Blueprint

PHASE 1 — START HERE

Twitter / X first: the highest-signal, lowest-friction wedge

X is the correct first platform for a Kenyan listening product. Kenyans on Twitter (KOT) is the loudest, most text-dense, most quote-able public square in the country — and X is the only one of the four with a usable, paid, official firehose-style API. Nail the end-to-end pipeline on X, then clone the pattern to the harder platforms.

Why X de-risks everything downstream

Public-by-default & text-first — no login wall on the data model itself; the payload is the text you actually need for NLP, not a video you must transcribe.
An official paid API exists — you can be ToS-compliant from day one and defer the legal/proxy complexity that FB/IG/TikTok force on you.
Richest geo & entity signal — bios, place tags, and KOT's habit of @-mentioning brands (Naivas, Quickmart, Carrefour) make entity extraction and Share-of-Voice work immediately.
It validates the whole spine — auth → keyword rule → ingest → normalize → dedup → classify → store → dashboard. Once that spine works for X, adding a platform is "write one adapter."

The X ingestion decision (production posture)

X's official API was repriced into tiers. As of 2025 the practical options for a startup:

Path	What you get	Cost / limits	Verdict for production
X API — Basic	App-level read, recent search (~7-day), modest monthly post cap	~$200/mo, ~10–15k posts/mo read	MVP only — caps too low for real listening
X API — Pro	Full-archive search, higher caps, filtered stream	~$5,000/mo	Compliant production path once revenue justifies it
Managed providers (Apify, Bright Data, TwitterAPI.io-style)	Search + profile + timeline via their infra; they absorb proxy/ban risk	Usage-priced (per 1k results)	Pragmatic bridge: real coverage at MVP cost
Self-hosted stealth scraping	Max coverage, min data cost	Proxy + maintenance + ban risk	ToS-violating & fragile — avoid as primary

Recommended hybrid: Build the ingestion worker behind a SourceAdapter interface so the same normalization/dedup/NLP pipeline runs regardless of where bytes come from. Start on a managed provider (fast, affordable coverage), keep the official X API adapter ready behind a feature flag, and flip to X API Pro for the streams that need ToS guarantees. You never rewrite the pipeline — you only swap the adapter.

Legal reality, stated plainly: scraping Meta, X, and TikTok violates their Terms of Service, and in some jurisdictions raises CFAA-style and data-protection questions (Kenya's Data Protection Act, 2019 applies once you store personal data of Kenyan residents). The compliant spine is: prefer official APIs, use reputable managed providers for gaps, store only public data, honor deletion, and register as a data controller where required. Treat self-hosted stealth scraping as a last resort with eyes open.

Phase-1 build order (X only)

Stand up auth + tenancy + one keyword_rule table (Section 2 schema).

Ship the XAdapter: takes a rule, returns raw posts. Provider-backed first.

Normalize → unified mentions row; dedup on content hash + platform id.

Geo-resolve (place tag → bio → mention text → tenant default) into geo JSONB.

Async NLP: sentiment + category + entities (Section 3). Write back to the row.

Dashboard: SoV, sentiment time-series, entity matrix — all from one table.

The starter worker that does steps 2–4 ships alongside this blueprint as x_ingestor.py.

1Scraper & Data Ingestion Architecture

A resilient ingestion tier is an adapter pattern behind a queue, not a pile of scrapers. Each platform is one adapter implementing a common contract; everything after the adapter is platform-agnostic.

Keyword rules
(per tenant)

→

Scheduler
(Celery beat)

→

Source Adapters
X · FB · IG · TikTok

→

Raw queue
(Kafka/SQS)

→

Normalizer
+ dedup

→

NLP queue

→

Postgres +
pgvector

1a · Official APIs vs stealth automation vs managed providers

Platform	Official API	Reality	Production choice
X / Twitter	Yes (Basic→Pro→Enterprise)	Best official option of the four; archive search behind Pro	API (Pro) or managed provider
Facebook	Graph API — only your own Pages/owned assets	No general public-post search since the old Public Feed/CrowdTangle wind-down (Meta Content Library is research-gated)	Meta Content Library if eligible; else managed provider for public Pages
Instagram	Graph API — business/creator accounts you manage + hashtag search (limited)	Hashtag Search API gives recent public media for a hashtag, with tight rate caps	Graph hashtag API + managed provider for breadth
TikTok	Research API (gated) + Business/Display API	Research API is approval-only and region-limited; otherwise no clean public search	Managed provider; Research API if you qualify

Net: only X gives you a clean pay-to-play public firehose. For the other three, a reputable managed provider is the realistic production source, with official APIs layered in where eligibility allows. Build for this asymmetry — don't assume parity across platforms.

1b · Resilience patterns (assume everything throttles and breaks)

Adapter contract: every adapter implements fetch(rule, cursor) → (RawItem[], next_cursor). The orchestrator never knows whether bytes came from an API or a provider.
Token-bucket rate limiting per source per credential (Redis), so one tenant's heavy rule can't exhaust a shared quota. Respect x-rate-limit-remaining/reset headers and back off proactively.
Exponential backoff + jitter + circuit breaker per adapter. On repeated 429/403, open the breaker, park the rule, alert.
Cursor checkpointing (since_id / pagination cursor persisted per rule) so a crash resumes instead of re-pulling.
Credential/proxy pool rotation for provider/scraper paths; health-check and quarantine bad IPs/sessions.
Idempotent ingest — dedup key (platform + native_post_id) means re-delivery is harmless. Critical for at-least-once queues.
Raw-payload landing zone — store the untouched JSON in cheap object storage (S3/R2) before normalization, so you can replay/re-parse when a schema changes without re-scraping.
Two ingest modes: streaming (X filtered stream for real-time) and polling (scheduled search for platforms without streams + historical backfill).

1c · Inferring location when the user never states it

Explicit geo is rare. Resolve location with a confidence-scored cascade, stored as structured geo with a geo_confidence field so the UI can filter on trust level:

Tier	Signal	Confidence
1	Native geotag / place object (X place, IG location) → reverse-geocode	High (0.9+)
2	Profile `location` field, parsed & gazetteer-matched ("Nai", "254", "Nrb" → Nairobi)	Medium (0.6–0.8)
3	In-text place mentions via NER + a Kenya gazetteer (Westlands, CBD, Kasarani, Mombasa…)	Medium (0.5–0.7)
4	Language/slang & currency cues (Sheng, "Ksh", "bob", KOT) → country-level prior	Low (0.3–0.4)
5	Network graph (who they reply to / who follows them) → cluster centroid	Low, but improves with volume

Build a Kenya gazetteer (counties, towns, neighborhoods, malls, landmarks, common misspellings & Sheng aliases) as a Postgres table with trigram + full-text indexes. Combine signals in a small scoring function; never overwrite a high-confidence native tag with a low-confidence inference. Expose geo_confidence as a dashboard filter so a user can choose "Nairobi (confirmed)" vs "Nairobi (inferred)."

2Data Pipeline & Normalization Schema

A Tweet, a TikTok caption, and a Facebook post are the same abstract object: an authored public utterance, at a time, maybe a place, that mentions things and carries engagement. Model that abstraction once; push platform quirks into JSONB.

2a · Mapping disparate models to one canonical "Mention"

Canonical field	X / Tweet	TikTok	Facebook / IG
`content`	tweet text	video description + caption	post message / caption
`author`	user object	author	page/profile
`published_at`	created_at	createTime	created_time
`engagement`	like/retweet/reply/quote	digg/share/comment/play	reactions/shares/comments
`media`	media entities	video URL + cover	attachments
`geo`	place / coordinates	(rare) region code	place
`raw`	the entire untouched provider/API payload (JSONB) — your insurance policy

Normalization rule of thumb: promote to a typed column only what you filter, sort, or join on (tenant, platform, time, geo, sentiment, category). Everything else lives in raw JSONB and gets pulled forward later if it earns a query.

2b · PostgreSQL schema (JSONB + pgvector)

Postgres over Mongo here: you need multi-tenant relational integrity (tenants↔rules↔mentions), and JSONB flexibility, and vector search for semantic dedup/RAG — all in one engine via pgvector. Partition the hot table by time.

-- Tenancy ----------------------------------------------------------
CREATE TABLE tenants (
  id           uuid PRIMARY KEY DEFAULT gen_random_uuid(),
  name         text NOT NULL,
  plan         text DEFAULT 'mvp',
  created_at   timestamptz DEFAULT now()
);

CREATE TABLE users (
  id           uuid PRIMARY KEY DEFAULT gen_random_uuid(),
  tenant_id    uuid REFERENCES tenants(id) ON DELETE CASCADE,
  email        citext UNIQUE NOT NULL,
  role         text DEFAULT 'member'
);

-- What each tenant is listening for --------------------------------
CREATE TABLE keyword_rules (
  id           uuid PRIMARY KEY DEFAULT gen_random_uuid(),
  tenant_id    uuid REFERENCES tenants(id) ON DELETE CASCADE,
  label        text,                          -- "Unga prices"
  terms        text[] NOT NULL,            -- ['unga','bei ya unga','maize flour']
  platforms    text[] DEFAULT '{x}',        -- start with X
  geo_filter   jsonb,                         -- {"country":"KE","city":"Nairobi"}
  cursors      jsonb DEFAULT '{}',           -- per-platform since_id / pagination
  is_active    boolean DEFAULT true,
  created_at   timestamptz DEFAULT now()
);

-- The canonical, platform-agnostic record --------------------------
CREATE TABLE mentions (
  id              uuid DEFAULT gen_random_uuid(),
  tenant_id       uuid NOT NULL,
  rule_id         uuid,
  platform        text NOT NULL,          -- x | facebook | instagram | tiktok
  native_post_id  text NOT NULL,          -- id on that platform
  content_hash    text NOT NULL,          -- sha256(normalized text) for dedup
  content         text,
  lang            text,
  author          jsonb,                     -- {handle, display, followers, verified}
  engagement      jsonb,                     -- {likes, shares, comments, views}
  media           jsonb,
  geo             jsonb,                     -- {country, city, lat, lng}
  geo_confidence  real DEFAULT 0,
  sentiment       real,                      -- -1..1
  sentiment_label text,                      -- pos | neu | neg
  categories      text[],                    -- ['RetailPricing','ConsumerSentiment']
  entities        jsonb,                     -- [{name:'Naivas',type:'BRAND',span:[..]}]
  embedding       vector(768),               -- pgvector: semantic dedup + RAG
  raw             jsonb,                     -- untouched source payload
  published_at    timestamptz NOT NULL,
  ingested_at     timestamptz DEFAULT now(),
  PRIMARY KEY (id, published_at)
) PARTITION BY RANGE (published_at);

-- Dedup + the indexes that make the dashboard fast -----------------
CREATE UNIQUE INDEX ux_mentions_dedup
  ON mentions (tenant_id, platform, native_post_id, published_at);
CREATE INDEX ix_mentions_tenant_time ON mentions (tenant_id, published_at DESC);
CREATE INDEX ix_mentions_categories  ON mentions USING GIN (categories);
CREATE INDEX ix_mentions_entities    ON mentions USING GIN (entities jsonb_path_ops);
CREATE INDEX ix_mentions_geo         ON mentions USING GIN (geo);
CREATE INDEX ix_mentions_embedding   ON mentions
  USING hnsw (embedding vector_cosine_ops);

Multi-tenant isolation: enable Postgres Row-Level Security — CREATE POLICY tenant_isolation ON mentions USING (tenant_id = current_setting('app.tenant_id')::uuid). The app sets app.tenant_id per request, so a query bug can't leak one customer's data into another's dashboard.

2c · Deduplication (two layers)

Exact: the unique index on (tenant, platform, native_post_id) kills re-delivery. content_hash catches cross-rule duplicates of the same post.
Near-duplicate / repost: the same claim retweeted/reworded thousands of times shouldn't count as thousands of independent signals. Use the embedding with an HNSW cosine threshold to cluster near-dupes into a thread/cluster_id, so Share-of-Voice counts distinct narratives, not raw volume.

3AI Categorization & Sentiment Engine

A tiered pipeline: cheap deterministic filters first, small fast models in the middle, an LLM only where ambiguity actually requires it. Running an LLM on every post is how you go bankrupt.

Clean & lang-detect

→

Rules / regex
(prices, $brands)

→

Embed
(multilingual)

→

Fast classifier
sentiment + category

→

LLM fallback
(low-confidence only)

→

Write back

3a · The NLP pipeline, stage by stage

Preprocess: strip tracking params, normalize unicode/emoji, expand handles & hashtags into tokens, language-detect (fastText/CLD3). Keep the original; transform a copy.
Deterministic extractors (free & precise): regex for prices (Ksh\s?\d[\d,]*, "bob", "/=", "k" for thousands), promo/discount terms, and a brand dictionary for the entities you already know you care about (Naivas, Quickmart, Carrefour, Chandarana, Cleanshelf…).
Embeddings: a multilingual model (e.g. bge-m3 / multilingual-e5) that handles English + Swahili + code-switching, stored in pgvector. Powers dedup, semantic search, and zero-shot categorization.
Categorization: embedding + zero-shot/few-shot against category prototype vectors → cheap, no per-call LLM cost. Falls back to an LLM only when top-2 categories are close.
Sentiment: a fine-tuned transformer (start from a multilingual/AfriBERTa-style base) on labeled Kenyan data; LLM adjudicates low-confidence and sarcasm-prone cases.
Entity extraction & linking: NER → link surface forms ("Naivas", "naivas supermarket", "@NaivasKenya") to a canonical entity in an entities table, so comparisons aggregate correctly.

3b · Auto-tagging into #RetailPricing, #LocalEvents, #ConsumerSentiment

Define each category as (a) a short natural-language definition, (b) a handful of labeled examples, and (c) a prototype embedding (mean of example embeddings). At ingest:

Cosine-match the post embedding to every prototype; assign categories above threshold (multi-label — a post can be both pricing and sentiment).
If the best matches are ambiguous, send only that post to an LLM with the category definitions for a final multi-label decision.
Periodically retrain prototypes from human-corrected labels (active learning) so the taxonomy sharpens over time.

#RetailPricing#LocalEvents#ConsumerSentiment #ProductQuality#CustomerService#Promotions#Availability

3c · Kenyan context: Sheng, code-switching, sarcasm

Off-the-shelf English sentiment models fail badly on KOT. Concrete handling:

Sheng/Swahili lexicon & normalizer: a maintained map — "fiti/poa/noma" (good), "mbaya/fake" (bad), "bei ya unga" (price of maize flour, a politically loaded cost-of-living signal), "bob/ganji/doo" (money), "unga" (flour/cash slang). Normalize before the model sees the text.
KOT & cultural markers: "KOT" as a community signal; recognize ratio/drag patterns, "wue", "ata", "si" intensifiers, and emoji-as-sentiment (🤣 can flip polarity).
Code-switching: never assume one language per post. Multilingual embeddings + a model fine-tuned on mixed English-Swahili-Sheng is non-negotiable.
Sarcasm & negation: the hardest class. Route low-confidence / sarcasm-flagged posts to the LLM with few-shot Kenyan examples; bank the corrections as new training data.
Aspect-based sentiment: "Naivas prices up but Carrefour parking is hell" is negative-price/Naivas + negative-experience/Carrefour. Tie sentiment to the entity+aspect, not the whole post — this is what makes competitive comparison trustworthy.

Build a labeling flywheel early. Your durable moat isn't the model — it's a growing, human-verified corpus of Kenyan social text with sentiment/category/entity labels. Capture every analyst correction in the dashboard and feed it back into fine-tuning.

4High-Level Reporting & Comparison Logic

Reporting is aggregation over the canonical table, sliced by entity, category, geo, and time. Because sentiment is stored per entity+aspect, competitive matrices are a GROUP BY — not a second pipeline.

4a · Price comparison & competitive matrices (Naivas vs Carrefour)

Filter to comparable mentions: posts where categories @> '{RetailPricing}' and entities contains the target brands.
Extract structured prices from the deterministic price-regex output, attaching each price to (entity, product/aspect) — e.g. (Naivas, 2kg unga, Ksh 210).
Normalize units (per kg, per litre) so a 2kg vs 1kg quote is comparable; reject outliers via IQR to kill typos/jokes.
Aggregate to a matrix: rows = products, columns = brands, cells = median price + sample size + 7-day trend + crowd-sentiment.
Confidence: every cell carries n and a freshness timestamp — never show a single noisy post as a "price."

-- Share of Voice between competing brands, last 30 days, Nairobi
SELECT e->>'name' AS brand,
       count(*) AS mentions,
       round(avg(sentiment)::numeric, 3) AS avg_sentiment
FROM mentions m, jsonb_array_elements(m.entities) e
WHERE m.tenant_id = current_setting('app.tenant_id')::uuid
  AND m.published_at > now() - interval '30 days'
  AND m.geo->>'city' = 'Nairobi'
  AND e->>'name' = ANY(ARRAY['Naivas','Carrefour','Quickmart'])
GROUP BY brand ORDER BY mentions DESC;

4b · Visualization & aggregation

Share of Voice — stacked area / 100% bar of mentions per brand over time; count distinct narrative clusters, not raw reposts.
Sentiment time-series — rolling daily mean ±band per entity; annotate spikes with the driving cluster (the "why").
Price index — median price line per brand per product, with crowd-sentiment overlaid.
Geo heatmap — county/neighborhood intensity, gated by geo_confidence.
Topic/entity co-occurrence graph — which brands get mentioned together, and in what sentiment.

Pre-aggregate for speed. Don't scan raw mentions for every dashboard load. Roll up into hourly/daily materialized views (or TimescaleDB continuous aggregates) keyed by (tenant, entity, category, geo, bucket). Dashboards read the rollups; the raw table is for drill-down only.

5Tech Stack & Roadmap

Scalable, budget-conscious, boring-on-purpose. Every choice favors one operational surface and a clear upgrade path over premature microservices.

Frontend

Next.js (React) + TypeScript, Tailwind, deployed on Vercel/Cloudflare Pages.
Recharts / ECharts for time-series & SoV; deck.gl/Mapbox for geo heatmaps.
Auth UI via Clerk/Auth.js; org switcher = tenant switcher.

Backend / API

FastAPI (Python) — same language as the ML stack; async; great for the NLP services.
REST + WebSocket (live mention feed); JWT scoped to tenant.
Pydantic models mirror the canonical schema.

Workers / Queue

Celery + Redis for MVP (ingest, normalize, NLP, rollups). Beat for schedules.
Upgrade path: Kafka / Redpanda when ingest volume outgrows Redis as a broker.
Separate queues per stage so NLP backpressure never stalls ingest.

Data

PostgreSQL + pgvector (one engine: relational + JSONB + vectors). Add TimescaleDB for time partitioning & continuous aggregates.
S3 / Cloudflare R2 raw-payload landing zone.
Redis for rate-limit buckets, cache, dedup bloom filters.

AI / LLM orchestration

Embeddings: bge-m3 / multilingual-e5 self-hosted (cheap at volume).
Sentiment/category: fine-tuned transformer served via vLLM / HF TGI.
LLM fallback: hosted API (Claude/GPT) behind a router; cache by content hash; cap spend per tenant.
Vector store = pgvector (no extra service until you need one).

Infra / Ops

Docker + a managed container host (Fly.io/Render for MVP → ECS/GKE at scale).
Managed Postgres (Neon/Supabase/RDS) so you don't babysit a DB early.
Observability: OpenTelemetry → Grafana; Sentry for errors; per-adapter health dashboards.

Phased roadmap

Weeks 1–4 — X spine. Auth + tenancy + rules + XAdapter + normalize/dedup + basic sentiment + a dashboard with SoV & sentiment time-series for one platform.

Weeks 5–8 — Intelligence. Entity linking, category prototypes, Kenyan sentiment fine-tune, price extraction, comparison matrix, geo cascade + gazetteer.

Weeks 9–14 — Breadth. Add TikTok & IG via managed-provider adapters (same pipeline), then Facebook Pages. Continuous aggregates + rollup dashboards.

Ongoing — Hardening. Circuit breakers, RLS audit, labeling flywheel, cost caps, alerting, and the X API Pro cutover when revenue justifies ToS-guaranteed streams.