dbt_attribution_gap_analysis v1

dbt_attribution_gap_analysis — v1

ID	5af54362-8efc-45cb-83a5-1218f5f748ea
Lens	dbt_attribution_gap_analysis
Function	assess
Phase	diagnosis
Topics	attribution, dbt
Version	v1
Created	2026-04-19T10:25:06+00:00
Content

{
  "gaps": [
    {
      "name": "referrer_classification",
      "layer": "channel_classification",
      "description": "No model classifies referrer domains into marketing channels. All traffic without a click ID or UTM parameter falls into a single undifferentiated 'Organic/Direct' bucket, preventing distinction between organic search, organic social, referral traffic, and true direct visits.",
      "impact_description": "Without referrer classification, the project cannot distinguish between a user arriving from a Google organic search result, a Reddit post linking to the product, a Hacker News referral, or a direct bookmark visit. All of these are collapsed into 'Organic/Direct', which inflates the direct channel and hides the true contribution of organic and referral traffic. This makes it impossible to evaluate whether SEO or community efforts are driving signups, and overstates the apparent reliance on paid channels in CAC calculations.",
      "recommendation": "Build an int_channel__referrer_classification model that maps $initial_referring_domain (already captured in stg_posthog__person_identity) to a channel using a CASE WHEN hierarchy: known search engine domains \u2192 'Organic Search', known social domains \u2192 'Organic Social', non-empty referrer not matching the product domain \u2192 'Referral', empty/null/self-referral \u2192 'Direct'. Maintain a seed file (seeds/referrer_domain_mapping.csv) with domain-to-channel mappings for extensibility. Integrate this into canonical_users' channel classification by extending the existing CASE WHEN hierarchy: click IDs > UTM > referrer classification > Direct.",
      "effort": "medium",
      "impact": "high",
      "priority": "high",
      "prerequisites": []
    },
    {
      "name": "attribution_coverage_and_health",
      "layer": "coverage_health",
      "description": "No models exist to measure the reliability and completeness of the attribution system itself. There is no tracking of behavioural coverage rate (% of conversions with an attributed touchpoint), direct-only rate, click ID capture rates by platform, or identity resolution match rates over time.",
      "impact_description": "Without coverage and health monitoring, the team cannot detect when tracking degrades \u2014 for example, if a website deploy breaks gclid capture, if PostHog's identity stitching quality shifts, or if UTM parameters are stripped by a new browser policy. These failures silently corrupt attribution results: CAC appears to improve (fewer attributed conversions in the denominator) while actual performance is unchanged. The team would only discover the problem weeks later through ad hoc investigation, by which time budget allocation decisions have been made on corrupted data.",
      "recommendation": "Build a mart_attribution__coverage_daily model that calculates daily metrics: (1) total conversions (from paid_budget__conversions or conversion_events), (2) conversions with at least one click ID or UTM, (3) conversions with only 'Organic/Direct' attribution, (4) click ID capture rate per platform (COUNT(gclid) / COUNT(*) for Google-sourced users, etc.), (5) identity resolution tier distribution from int_identity__resolution (count per match_tier per day). Use SUM(CASE WHEN) for coverage flags and GROUP BY date for the daily grain. Optionally add a mart_attribution__coverage_weekly model that computes 7-day rolling averages and week-over-week deltas to detect drift. For XmR process charts, calculate the moving range (absolute difference between consecutive weekly values) and derive upper/lower natural process limits (mean \u00b1 2.66 \u00d7 mean moving range) to flag statistically significant shifts.",
      "effort": "medium",
      "impact": "high",
      "priority": "high",
      "prerequisites": []
    },
    {
      "name": "last_touch_attribution_model",
      "layer": "attribution_models",
      "description": "The project implements only first-touch attribution. There is no last-touch model that credits the final touchpoint before conversion, despite int_user_id__attribution_history storing the multi-touch data needed to build one.",
      "impact_description": "First-touch attribution alone overvalues awareness channels (top-of-funnel) and undervalues channels that drive the final conversion decision. For a product with a multi-day consideration cycle, a user might discover the product through a Google ad (first touch) but convert after clicking a Reddit retargeting ad (last touch). Without a last-touch model, the Reddit campaign appears to generate zero conversions even though it is closing deals. This leads to misallocation of paid budget toward awareness channels at the expense of conversion channels.",
      "recommendation": "Build an int_attribution__last_touch model that reads int_user_id__attribution_history (already stored in long format with person_version for ordering) and selects the most recent touchpoint per user using ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY posthog_created_at DESC) = 1. Apply the same click ID > UTM > referrer classification hierarchy used in canonical_users to derive last_touch_channel. Then build a paid_budget__conversions_last_touch model that mirrors paid_budget__user_conversions but joins to the last-touch channel instead of first-touch. This enables side-by-side comparison of first-touch vs last-touch attribution in the weekly/monthly performance reports.",
      "effort": "medium",
      "impact": "high",
      "priority": "high",
      "prerequisites": []
    },
    {
      "name": "session_detection",
      "layer": "signal_touchpoint",
      "description": "The project relies entirely on PostHog's native session handling (posthog_sessions is a passthrough staging model) rather than implementing its own session detection logic in dbt. This means session boundaries are opaque, not customizable, and cannot be extended with business-specific rules.",
      "impact_description": "Without dbt-native session detection, the project cannot customize session timeout windows (PostHog defaults to 30 minutes but the product may warrant different thresholds for different surfaces), cannot sessionize gateway/backend events that occur outside PostHog, and cannot build session-level attribution where a session's entry channel is determined by the first event's referrer/UTM. This limits touchpoint deduplication \u2014 a user who visits 5 pages from the same email click generates 5 potential touchpoints rather than being grouped into a single session-level touchpoint. It also prevents session-grain attribution models that would be more accurate for multi-touch analysis.",
      "recommendation": "Build an int_sessions__detected model that reads stg_posthog__events_filtered, uses LAG(event_timestamp) OVER (PARTITION BY distinct_id ORDER BY event_timestamp) to detect inactivity gaps exceeding 30 minutes, assigns a session_id via a running SUM of the gap flag, and enriches each session with entry-level attribution (first event's UTM, referrer, click ID using FIRST_VALUE). This creates a session-grain touchpoint table that can serve as the foundation for session-level attribution. Downstream, int_user_id__attribution_history could be extended to reference session-level touchpoints rather than person property versions.",
      "effort": "medium",
      "impact": "medium",
      "priority": "medium",
      "prerequisites": []
    },
    {
      "name": "multi_touch_attribution_models",
      "layer": "attribution_models",
      "description": "No multi-touch attribution models (linear, time-decay, or position-based) are implemented. int_user_id__attribution_history stores touchpoint data in long format but no downstream model distributes fractional credit across touchpoints.",
      "impact_description": "Without multi-touch models, the project can only answer 'which channel gets all the credit' (first-touch or last-touch) but not 'how much did each channel contribute to this conversion.' For users with 3+ touchpoints across different channels, both first-touch and last-touch models will always ignore the middle touchpoints entirely. This creates blind spots in the marketing mix where channels that play an assist role (e.g., retargeting, email nurture) appear to have zero impact. Multi-touch models are particularly valuable for understanding how paid and organic channels interact in the conversion journey.",
      "recommendation": "Build an int_attribution__multi_touch model that reads int_user_id__attribution_history (or the new session-level touchpoints if session_detection is built first), counts touchpoints per user, and calculates fractional credit under three models: (1) linear: 1/n credit per touchpoint, (2) time-decay: exponential decay from conversion backward using EXP(-lambda * hours_before_conversion), normalized to sum to 1, (3) position-based: 40% first, 40% last, 20% split among middle touches using CASE WHEN with ROW_NUMBER and COUNT. Output at (user_id, touchpoint_id, model_type, credit) grain. Build a mart_attribution__model_comparison that aggregates credit by channel \u00d7 model_type for side-by-side comparison in BI tools.",
      "effort": "high",
      "impact": "medium",
      "priority": "medium",
      "prerequisites": [
        "last_touch_attribution_model"
      ]
    },
    {
      "name": "model_comparison_capability",
      "layer": "attribution_models",
      "description": "No model exists that compares the outputs of different attribution models side by side. The project cannot currently answer 'how does channel X's credited conversions differ between first-touch and last-touch?'",
      "impact_description": "Without model comparison, the team cannot evaluate whether their chosen attribution model (first-touch) is producing significantly different results than alternatives. If first-touch and last-touch produce nearly identical results for most channels, the team can be confident in their choice. If they diverge significantly for key channels, it signals that the attribution model choice materially impacts budget decisions and warrants deeper investigation. Model comparison is also the foundation for building trust with stakeholders \u2014 showing multiple perspectives reduces the appearance of cherry-picking a favorable model.",
      "recommendation": "Build a mart_attribution__model_comparison model that joins first-touch channel attribution (from canonical_users / paid_budget__conversions) with last-touch attribution (from the proposed int_attribution__last_touch) and optionally multi-touch credit (from int_attribution__multi_touch). Output at (date, channel, product) grain with columns for first_touch_conversions, last_touch_conversions, linear_credit, time_decay_credit, position_based_credit. Add a divergence_score column calculated as MAX(model_value) - MIN(model_value) / AVG(model_value) per channel-date to highlight channels where model choice matters most.",
      "effort": "low",
      "impact": "medium",
      "priority": "medium",
      "prerequisites": [
        "last_touch_attribution_model"
      ]
    },
    {
      "name": "survey_integration",
      "layer": "attribution_models",
      "description": "stg_backend__kilocode_users contains a customer_source field with regex-based classification of self-reported attribution (e.g., 'How did you hear about us?'), but this data is not integrated into the attribution models or compared against behavioral attribution signals.",
      "impact_description": "Self-reported attribution captures channels that are invisible to behavioral tracking \u2014 word of mouth, podcast mentions, conference talks, YouTube reviews, and other 'dark social' sources. Without integrating this signal, the attribution system systematically undervalues these channels and over-attributes to the last trackable digital touchpoint. For a developer tool where community and word-of-mouth are likely significant acquisition drivers, this blind spot could lead to under-investment in community and content marketing.",
      "recommendation": "Build an int_attribution__survey_responses model that reads stg_backend__kilocode_users.customer_source (already regex-classified), maps the classified values to a standardized channel taxonomy matching the behavioral attribution channels, and joins to canonical_users. Build a mart_attribution__survey_vs_behavioral model that compares self-reported channel to the first-touch behavioral channel at (signup_week, self_reported_channel, behavioral_channel) grain using COUNT for each combination. This produces a confusion-matrix-style output showing where behavioral and self-reported attribution agree and disagree, revealing channels that behavioral tracking systematically misses.",
      "effort": "low",
      "impact": "medium",
      "priority": "medium",
      "prerequisites": []
    },
    {
      "name": "device_graph",
      "layer": "identity_resolution",
      "description": "The identity resolution system maps PostHog persons to canonical users but does not link multiple physical devices to a single user via fingerprinting or device-level signals. stg_backend__stytch_fingerprints exists in staging but is not consumed by the identity resolution chain.",
      "impact_description": "Without a device graph, a user who discovers the product on their phone (clicking a social ad) and later signs up on their laptop is treated as two separate journeys. The mobile ad touchpoint is lost because the phone's anonymous distinct_id is never linked to the laptop's authenticated user. This systematically undervalues mobile-first channels (social ads, display) and overvalues desktop-origin channels. For a developer tool where mobile discovery \u2192 desktop conversion is a common pattern, this gap could meaningfully distort channel attribution.",
      "recommendation": "Build an int_identity__device_graph model that reads stg_backend__stytch_fingerprints and links fingerprint_id to user_id (for authenticated sessions) and distinct_id (for anonymous sessions). Use the fingerprint as a bridge: if fingerprint F is seen with anonymous distinct_id D1 on mobile and later with authenticated user_id U1 on desktop, map D1 \u2192 U1. Implement as a supplementary identity resolution tier in int_identity__resolution (e.g., tier between email_only and unresolved_posthog). Use ROW_NUMBER() OVER (PARTITION BY fingerprint_id ORDER BY created_at) to handle fingerprint collisions (multiple users sharing a fingerprint on shared devices), only linking when a fingerprint maps to exactly one authenticated user.",
      "effort": "high",
      "impact": "medium",
      "priority": "low",
      "prerequisites": []
    },
    {
      "name": "user_agent_heuristics",
      "layer": "channel_classification",
      "description": "stg_backend__http_user_agent exists in staging but is not used for channel classification or traffic quality filtering. User-agent data could support device/browser-based classification and improve bot detection.",
      "impact_description": "Without user-agent heuristics, the project cannot distinguish between in-app browser traffic (which may indicate social platform referrals even when referrer is stripped), headless browser bot traffic that passes basic bot filters, or platform-specific browser signatures that could improve attribution when UTMs and referrers are missing. This is a lower-impact gap because the primary classification hierarchy (click IDs > UTM > referrer) handles most cases, but user-agent data could improve edge case classification and data quality.",
      "recommendation": "Build an int_channel__user_agent_enrichment model that parses stg_backend__http_user_agent using CASE WHEN LIKE patterns to detect: (1) in-app browsers (FBAN/FBAV for Facebook, Instagram, LinkedIn), (2) known bot user-agent strings not caught by the existing PostHog bot filter, (3) device type (mobile vs desktop vs tablet). Join this to the session or event grain to add device_type and is_in_app_browser flags. Use is_in_app_browser as a supplementary signal in referrer classification \u2014 if the user-agent indicates a Facebook in-app browser but no fbclid is present, classify as 'Organic Social (Meta)' rather than 'Direct'.",
      "effort": "medium",
      "impact": "low",
      "priority": "low",
      "prerequisites": [
        "referrer_classification"
      ]
    }
  ]
}