dbt_attribution_mapping — v1

Back to project
ID9123c1a5-2b8c-4851-ae0c-c88812c93757
Lensdbt_attribution_mapping
Functionassess
Phasediagnosis
Topicsattribution, dbt
Versionv1
Created2026-04-19T10:23:32+00:00

Content

{
  "layers": [
    {
      "name": "identity_resolution",
      "status": "complete",
      "models": [
        "int_identity__resolution",
        "int_person_id__to_distinct_id",
        "int_posthog__person_collapsed",
        "int_user_id__to_distinct_id",
        "int_user_id__to_person_id",
        "canonical_users",
        "stg_kilocode__user_identity",
        "stg_posthog__person_identity"
      ],
      "capabilities_present": [
        "identity stitching",
        "cross-session linking",
        "merged ID handling",
        "deduplication"
      ],
      "capabilities_missing": [
        "device graph"
      ],
      "evidence": "int_identity__resolution implements a 7-tier matching system (strong_id > email_validated_user_id > email_validated_kilo_user_id > single_id_exists > email_only > unresolved_posthog > unresolved_kilocode) that stitches PostHog persons to canonical user_ids across two systems. int_posthog__person_collapsed deduplicates multiple PostHog persons to a single canonical user_id using MIN_BY for first-touch preservation. The three-step mapping chain (int_user_id__to_person_id \u2192 int_person_id__to_distinct_id \u2192 int_user_id__to_distinct_id) provides cross-session linking by mapping canonical user_ids through person_ids to all PostHog distinct_ids. canonical_users serves as the single source of truth, combining identity resolution with attribution data. int_person_id__to_distinct_id is described as 'device graph mapping' but it maps person_id to distinct_id rather than linking multiple physical devices to a single user via fingerprinting or device-level signals.",
      "notes": "Identity resolution is well-implemented for the two-system (PostHog + Kilocode backend) architecture. The 7-tier matching provides graceful degradation from strong matches to unresolved records. However, there is no true device graph linking multiple physical devices (e.g., phone + laptop) to a single user outside of PostHog's built-in person merging. The stg_backend__stytch_fingerprints staging model exists but is not consumed by the identity resolution chain, suggesting fingerprint-based linking is available but unused."
    },
    {
      "name": "signal_touchpoint",
      "status": "partial",
      "models": [
        "int_conversion_events__posthog",
        "int_conversion_events__synthetic_daily",
        "int_posthog__user_event_firsts",
        "int_user__event_firsts",
        "int_user__website_visits",
        "int_user_id__attribution_history",
        "daily_marketing_acquisition",
        "stg_posthog__events_filtered",
        "stg_posthog__person_identity"
      ],
      "capabilities_present": [
        "event staging",
        "touchpoint extraction",
        "data quality filters",
        "signal type classification",
        "temporal ordering"
      ],
      "capabilities_missing": [
        "session detection"
      ],
      "evidence": "stg_posthog__events_filtered provides clean event staging with deduplication on UUID and bot traffic detection, reducing scan volume by ~48%. int_conversion_events__posthog classifies events by type (user_created, first_usage, payment_succeeded, pageview) using CASE WHEN patterns \u2014 this is signal type classification and touchpoint extraction. int_user_id__attribution_history stores all touchpoints per user in long format with temporal ordering via person_version, filtering to rows where at least one attribution field is non-null. int_user__event_firsts assembles milestone timestamps across all sources (PostHog, Stripe, backend) using MIN(CASE WHEN) for first-occurrence detection \u2014 establishing temporal ordering. int_conversion_events__synthetic_daily generates synthetic retention signals with power-curve conversion values. Session detection is delegated entirely to PostHog's native session handling (posthog_sessions staging model exists as a passthrough) rather than being built in dbt with inactivity windows.",
      "notes": "The signal/touchpoint layer is strong on event classification and quality filtering but relies on PostHog for session detection rather than implementing its own session logic. The attribution_history model stores touchpoints in long format but these are derived from PostHog person property versions (initial_* fields) rather than from individual event-level touchpoints with explicit timestamps, which limits the granularity available for multi-touch attribution. Synthetic daily conversion events add valuable retention signals for ad platform optimization."
    },
    {
      "name": "channel_classification",
      "status": "partial",
      "models": [
        "canonical_users",
        "int_paid_budget__ad_spend_by_platform",
        "int_paid_budget__ad_spend_geo",
        "paid_budget__enterprise_conversions",
        "paid_budget__teams_conversions",
        "paid_budget__user_conversions",
        "paid_budget__event_conversions",
        "paid_budget__kiloclaw_conversions",
        "int_user__attribution_resolved",
        "stg_posthog__person_identity"
      ],
      "capabilities_present": [
        "UTM parsing",
        "click ID detection",
        "classification hierarchy",
        "fallback handling"
      ],
      "capabilities_missing": [
        "referrer classification",
        "user-agent heuristics"
      ],
      "evidence": "canonical_users derives paid_acquisition_channel from first-touch signals using 'CASE WHEN for channel classification from click IDs' with click ID priority over UTM fallback \u2014 this implements the classification hierarchy. stg_posthog__person_identity extracts and cleans all initial_* and current attribution properties including UTM parameters. Click ID detection covers gclid (Google), fbclid (Meta), msclkid (Bing), rdt_cid (Reddit), and li_fat_id (LinkedIn) as evidenced by daily_marketing_acquisition's 'COUNT with click ID coalesce for channel detection' and the platform-specific conversion models. int_user__attribution_resolved implements 90-day staleness windows per platform. The paid_budget conversion models (enterprise, teams, user, event, kiloclaw) all apply a consistent CASE WHEN hierarchy for channel classification from click IDs with UTM fallback. Fallback handling assigns 'Organic/Direct' rows as seen in paid_budget__user_conversions. However, no model implements referrer domain classification (mapping google.com \u2192 organic search, facebook.com \u2192 social). No model uses user-agent heuristics for classification improvement despite stg_backend__http_user_agent existing.",
      "notes": "Channel classification is heavily oriented toward paid channels via click IDs, which is appropriate for a paid budget attribution system. The classification hierarchy (click IDs > UTM > fallback) is consistently applied across all conversion models. However, the lack of referrer classification means organic traffic channels are poorly differentiated \u2014 everything without a click ID or UTM falls into a single 'Organic/Direct' bucket. This is a significant gap for understanding the full marketing mix. The stg_backend__http_user_agent table exists in staging but is not used for classification."
    },
    {
      "name": "attribution_models",
      "status": "partial",
      "models": [
        "int_user__attribution_resolved",
        "paid_budget__enterprise_conversions",
        "paid_budget__teams_conversions",
        "paid_budget__user_conversions",
        "paid_budget__event_conversions",
        "paid_budget__kiloclaw_conversions",
        "paid_budget__conversions",
        "canonical_users"
      ],
      "capabilities_present": [
        "first-touch model",
        "grain flexibility"
      ],
      "capabilities_missing": [
        "last-touch model",
        "multi-touch models",
        "model comparison",
        "survey integration"
      ],
      "evidence": "The system implements first-touch attribution exclusively. canonical_users uses 'ROW_NUMBER for best match tier selection' to select the first-touch attribution data. int_user__attribution_resolved consolidates multi-touch history into 'one canonical row per user, selecting initial click ID when within staleness window and falling back to most recent' \u2014 but the output is a single resolved row, not a multi-touch distribution. All paid_budget conversion models (enterprise, teams, user, event, kiloclaw) attribute conversions using first-touch click IDs from canonical_users. int_posthog__person_collapsed uses 'min_by(field, posthog_created_at) for first-touch attribution to match PostHog's $initial_* behavior'. Grain flexibility exists at both user level (paid_budget__user_conversions) and organization/subscription level (paid_budget__enterprise_conversions, paid_budget__teams_conversions). int_user_id__attribution_history stores multi-touch data in long format but no downstream model distributes credit across touchpoints. stg_backend__kilocode_users has a customer_source field with regex-based classification (self-reported attribution) but this is not integrated into the attribution models.",
      "notes": "The attribution system is fundamentally first-touch only, which is a deliberate design choice optimized for paid media budget allocation and ad platform conversion syncing. int_user__attribution_resolved does implement a first-touch-with-last-touch-fallback pattern (using initial click ID within 90-day window, falling back to most recent), but this is not a true last-touch model \u2014 it's a staleness recovery mechanism. The int_user_id__attribution_history model stores the raw multi-touch data that could support linear/time-decay/position-based models, but no such models are built. The customer_source field in kilocode_users (self-reported channel) exists but is not compared against behavioral attribution. No model comparison capability exists."
    },
    {
      "name": "coverage_health",
      "status": "missing",
      "models": [],
      "capabilities_present": [],
      "capabilities_missing": [
        "behavioural coverage rate",
        "direct-only rate",
        "model comparison metrics",
        "XmR process charts",
        "data quality metrics"
      ],
      "evidence": "No model in the inventory calculates attribution coverage rates, direct-only rates, or any statistical process control metrics for attribution health. The paid_budget__monthly_performance and paid_budget__weekly_performance models compute CAC and ROI but do not measure attribution system reliability itself (e.g., what percentage of conversions have a known touchpoint). stg_posthog__events_filtered includes bot detection which is a data quality filter, but there are no dedicated models that track the quality and completeness of the attribution pipeline over time. The admin utility models (active_queries, columns, source_columns) are Snowflake infrastructure monitors, not attribution health monitors. daily_marketing_acquisition provides a daily snapshot of paid channel signups but does not measure coverage gaps.",
      "notes": "This is the most significant gap in the attribution system. There are no models that answer questions like: 'What percentage of conversions have at least one attributed touchpoint?', 'Is the direct-only rate increasing over time?', or 'Has the gclid match rate degraded this week?' The paid budget performance models (monthly/weekly) measure marketing outcomes (CAC, ROI) but not attribution system health. Without coverage and health monitoring, the team cannot detect when tracking breaks, click ID capture rates degrade, or identity resolution quality shifts \u2014 all of which silently corrupt attribution results."
    }
  ]
}