Mastering the BigQuery GA4 Daily Export Schema in the Consent Mode Era
By Zac Nash|8 Oct 2024
Welcome to the world of the BigQuery GA4 daily export schema! If you're scratching your head wondering what all those columns and tables mean, don't worry—you're not alone. In this post, I'll break down the BigQuery GA4 daily export schema in a way that even Sherlock Holmes would appreciate. Grab your magnifying glass, and let's get started!
This blog covers the following topics:
- What is the BigQuery GA4 Daily Export Schema?
- The Big Picture: Understanding the Daily Export Table
- Key Components of the Schema
- E-commerce and Items: A Key Feature
- Practical Examples and Use Cases
- Primary Differences Between GA4 UI and BigQuery Daily Export
- Accessing Traffic Source and Medium in GA4 BigQuery
- Mapping to Default Channel Grouping
- Be Careful: Common Pitfalls and Considerations
- Differences from GA4 UI
- Understanding Consent Mode
- Effects on BigQuery Daily Export
- Best Practices and Considerations
- Conclusion
What is the BigQuery GA4 Daily Export Schema?
Before we dive into the nitty-gritty, let's get a bird's eye view.
- Definition: The BigQuery GA4 daily export schema is a structured format for the data exported from Google Analytics 4 (GA4) to BigQuery.
- Purpose: It allows you to perform in-depth analysis and build custom reports that go beyond the standard GA4 capabilities.
- Components: The schema includes various tables and fields that capture user interactions, events, and more.
The Big Picture: Understanding the Daily Export Table
What Each Row Represents
- Each row in the daily export table represents a single event (that occurred) on your website or app. This is a fundamental shift from Universal Analytics' session-based data model to GA4's event-based model.
- Event-centric: Each row is an event (e.g., page view, click, purchase).
- Granular Data: Every user interaction is captured as an event, providing more detailed insights.
Stuck on analytics? We can help!
Key Components of the Schema
1. The Event Table
The heart of the GA4 export schema is the event table. This is where the magic happens.
- event_name: The name of the event (e.g., 'page_view', 'click').
- event_timestamp: When the event occurred (in microseconds).
- user_id: Unique identifier for the user (if set).
2. User Properties
These fields provide insights into who your users are.
- user_pseudo_id: Anonymised ID to track users across sessions.
- user_properties: Custom properties you've set for users (e.g., 'preferred_language')
3. Event Parameters
Event parameters add context to your events.
- param_name: The parameter's name (e.g., 'button_text').
- param_value: The value of the parameter.
E-commerce and Items: A Key Feature
One of the GA4 export schema's most valuable aspects for e-commerce businesses is the detailed tracking of e-commerce events and item data.
E-commerce Fields
These fields provide comprehensive information about the various e-commerce events.
- ecommerce.total_item_quantity: Total number of items included in the event.
- ecommerce.purchase_revenue: Revenue generated from the purchase event.
- ecommerce.tax_value: Tax value for the event.
- ecommerce.shipping_value: Shipping cost for the event.
- ecommerce.refund_value: Refund value for the event.
- ecommerce.transaction_id: Unique identifier for the transaction.
- ecommerce.unique_items: Number of unique items in the event.
Item-level Data
Each e-commerce event can also include detailed item-level data, providing granular insights into product performance.
- items.item_id: Unique identifier for the item.
- items.item_name: Name of the item.
- items.item_brand: Brand of the item.
- items.item_category: Category of the item.
- items.item_variant: Variant of the item.
- items.item_price: Price of the item.
- items.item_quantity: Quantity of the item.
Breaking Down a Typical E-commerce Event
Let's take a closer look at a purchase event to see how these components come together.
- event_name: 'purchase'
- event_timestamp: 1622547800000000
- user_pseudo_id: 'ABCDEFG12345'
- user_properties: { 'preferred_language': 'en' }
- ecommerce: {
- total_item_quantity: 3,
- purchase_revenue: 120.50,
- tax_value: 10.50,
- shipping_value: 5.00,
- transaction_id: 'TX123456',
- unique_items: 2 }
- items: [
{ 'item_id': 'SKU123', 'item_name': 'Running Shoes', 'item_brand': 'BrandX', 'item_category': 'Sportswear', 'item_variant': 'Red', 'item_price': 60.25, 'item_quantity': 1 }, { 'item_id': 'SKU124', 'item_name': 'Water Bottle', 'item_brand': 'Brandy', 'item_category': 'Accessories', 'item_variant': 'Blue', 'item_price': 30.00, 'item_quantity': 2 }
]
Interestingly, in the BigQuery GA4 daily export schema, a single purchase event encapsulates multiple levels of data—each providing different insights.
- Event Level: Captures the overall transaction with fields like event_name and event_timestamp. There's one event for the entire purchase.
- E-commerce Level: Provides aggregated details about the transaction such as total_item_quantity (total items bought, here 3), purchase_revenue (total revenue, here 120.50), tax_value, shipping_value, and unique_items (distinct products, here 2).
- Item Level: Lists detailed information for each product in the items array. For instance, one Running Shoes with item_quantity of one and two Water Bottles with item_quantity of two. Each item entry includes specifics like item_id, item_name, item_brand, item_category, item_variant, item_price, and item_quantity. This shows a breakdown of how we get to the three total items bought, and the two unique items, at the e-commerce level.
Practical Examples and Use Cases
1. Analysing User Behaviour
By examining the event table, you can identify which pages are most popular, where users drop off, and what actions they take.
Example: Use SQL queries to find the most viewed pages.
SELECT event_params.value.string_value AS page_location, COUNT(*) as page_views
FROM `your_project.your_dataset.events_*`,
UNNEST(event_params) AS event_params
WHERE event_name = 'page_view'
AND event_params.key = 'page_location'
GROUP BY page_location
ORDER BY page_views DESC;
2. E-commerce Performance
Track the performance of your products and transactions using e-commerce-specific fields.
Example: Query to extract e-commerce transaction data.
SELECT
ecommerce.total_item_quantity,
ecommerce.purchase_revenue,
ecommerce.tax_value,
ecommerce.shipping_value,
ecommerce.transaction_id,
ecommerce.unique_items
FROM `your_project.your_dataset.events_*`
WHERE event_name = 'purchase'
LIMIT 100;
3. Item Performance
Dive deeper into item-level data to understand which products are driving sales.
Example: Query to extract item performance data.
SELECT
items.item_id,
items.item_name,
items.item_brand,
items.item_category,
items.item_variant,
items.item_price,
items.item_quantity
FROM `your_project.your_dataset.events_*`,
UNNEST(items) AS items
WHERE event_name = 'purchase'
LIMIT 100;
Primary Differences Between GA4 UI and BigQuery Daily Export
1. Data Processing and Aggregation
- GA4 UI: The data shown in the GA4 interface often includes aggregated and sampled data to deliver quick insights. Metrics like user counts and session durations might be approximated using algorithms, which can lead to small discrepancies (1-4%) compared to raw data. In the consent mode era, the data can be completely different depending on the reporting identity.
- BigQuery Export: This provides raw, unaggregated event-level data, allowing for more precise and detailed analysis. However, replicating the aggregated metrics shown in the UI might require more complex querying.
2. Metric Definitions and Calculations
- User Metrics:
- GA4 UI: It uses a default of 'Active Users' to quantify users rather than 'Total Users'. Depending on the reporting identity, this may also include modelled Users. Again, more on this is in the consent mode section.
- BigQuery: Relies on user_pseudo_id and user_id (if set), giving a direct count of users based on events. Google Signals data and AI-driven predictions are not exported to BigQuery, which can lead to differences in user and session counts. Active users are also a possible metric you can extract from the BigQuery Export.
- Session Metrics:
- GA4 UI: Sessions are calculated using the session_start event and might be combined over multiple days if no session timeout occurs.
- BigQuery: It requires explicit querying to combine user_pseudo_id with ga_session_id to calculate session metrics accurately. Sessions spanning multiple days need to be handled differently compared to the UI.
3. Data Retention and Historical Data
- GA4 UI: Data retention policies may limit access to historical data (typically up to 14 months for non-standard reports)
- BigQuery: It retains data indefinitely from the point of activation, allowing long-term historical analysis. Activating BigQuery export early is crucial as it does not backfill historical data.
Accessing Traffic Source and Medium in GA4 BigQuery
In GA4's BigQuery export, traffic sources are represented primarily by the traffic_source table. Within this table, you can access vital fields such as:
- traffic_source.source (the source of the traffic, like Google or Bing)
- traffic_source.medium (the marketing medium, like "organic" or "CPC")
To map these fields into GA4's default channel groupings, you can query them directly from BigQuery using SQL. Here's a simplified query for obtaining the source and medium:
SELECT
traffic_source.source,
traffic_source.medium,
traffic_source.campaign
FROM
`your_project.your_dataset.ga4_export_*`
WHERE
_TABLE_SUFFIX BETWEEN '20240101' AND '20241231'
Recently, GA4 introduced a new dimension,
session_traffic_source_last_click, which captures the last click attribution for a session. This data is available within nested fields in BigQuery and can provide more accurate attribution than the standard traffic_source fields. To retrieve this data, you can unnest the session_traffic_source_last_click field:
SELECT
event_date,
session_traffic_source_last_click.manual_campaign.source AS session_source,
session_traffic_source_last_click.manual_campaign.medium AS session_mediumFROM
`your_project_id.your_dataset_id.ga4_table_name`
Data dilemma? Let’s solve it together!
Mapping to Default Channel Grouping
GA4's default channel groupings are rule-based classifications. These can be created in BigQuery by applying regex-based rules to the source and medium fields. For example:
SELECT
CASE
WHEN traffic_source.source = 'google' AND traffic_source.medium = 'organic' THEN
'Organic Search'
WHEN traffic_source.source = 'facebook' AND traffic_source.medium = 'social'
THEN 'Social'
ELSE 'Other'
END AS default_channel_grouping
FROM
`your_project.your_dataset.ga4_export_*`
Differences from GA4 UI
In the GA4 Analytics UI, the traffic source and medium fields are readily available through predefined reports (like the Acquisition report). However, the UI's default channel grouping is predefined and cannot be customised directly. In BigQuery, you have much more flexibility, allowing for custom channel groupings and deeper attribution models. Additionally, the session_traffic_source_last_click field is exclusive to BigQuery and provides more accurate last-click attribution, which differs from the UI's built-in model.
Using BigQuery allows you to create custom rules and tweak the attribution models, offering more control than the GA4 interface.
By integrating the traffic_source fields and the session_traffic_source_last_click, you can gain a more comprehensive understanding of your traffic channels and their performance.
Be Careful: Common Pitfalls and Considerations
1. Data Sampling and Estimates:
The GA4 UI often uses sampled data for faster reporting, which can lead to discrepancies. In contrast, BigQuery uses raw data without sampling, which is more accurate but may require more computational resources to process.
2. Event Count Discrepancies:
Differences in event counts can arise due to the exclusion of certain events or parameters in the GA4 UI, while BigQuery includes every raw event logged. Custom Parameters that GTM is collecting (or otherwise) will only show in the UI if a corresponding custom dimension has been set up. Custom Parameters will always be sent to BigQuery.
3. Complex Querying in BigQuery:
- To replicate some of the metrics from the UI in BigQuery, it involves handling nested and repeated fields and may not be straightforward.
- Nested Fields: Fields within fields (e.g., event_params and user_properties) require special handling to access their data.
- Repeated Fields: Arrays of values (e.g., items in a purchase event) need to be unnested to query effectively.
- Example: To extract parameters from nested fields, using the UNNEST function.
SELECT
event_name,
param.key AS param_key,
param.value.string_value AS param_value
FROM
`your_project.your_dataset.events_*`,
UNNEST(event_params) AS param
WHERE
event_name = 'purchase';
4. Cost Considerations:
BigQuery charges for data storage and query processing. While initial usage might be covered under free tiers, extensive querying and large datasets can incur significant costs. Google has information for best practices to help with BigQuery Costs.
Understanding Consent Mode
What is Consent Mode?
Consent Mode allows you to specify whether Google Analytics and other Google tags can collect data based on the user's consent status. It distinguishes between different types of consent, such as analytics cookies and ads cookies, to be set in the user's browser.
Basic vs. Advanced Consent Mode
Basic Consent Mode:
In this mode, if a user denies consent, Google tags do not fire, meaning no data is collected for that user. This mode is straightforward and compliant but can result in significant data loss—up to 30% or more—since unconsented users' data is completely excluded from both the GA4 UI and BigQuery exports.
Advanced Consent Mode:
This mode is more sophisticated and allows Google tags to fire without storing identifiable data, such as cookies. It tracks all events without identifying the user or session, which helps with compliance while minimising data loss. However, it still affects the granularity and completeness of user metrics because user_pseudo_id and session ID values are missing for unconsented events. This means that while all event data is available, the ability to group it by user or session is lost due to the lack of an ID to connect them. Within the UI, Google uses event data to model the number of unconsented users and sessions.
Effects on BigQuery Daily Export
1. Data Availability
- Without Consent (Basic Consent Mode): Events from users who do not grant consent are not tracked at all.
- Without Consent (Advanced Consent Mode): Events from users who do not grant consent are still tracked but lack identifiers like user_pseudo_id. This results in incomplete user and session scope data, making it challenging to perform user-level or session-level analysis accurately. For example, you will know that three form_submit events happened but you won’t know if they happened in the same session or by the same user.
- With Consent: When consent is granted, both before and after the consent, all events are recorded with identifiers, allowing for complete and accurate user tracking.
2. Behavioural and Conversion Modelling:
- When is modelling enabled? Modelling within the GA4 has certain requirements to activate. Firstly, Advanced Consent Mode needs to be implemented, Basic Consent Mode does not allow for Modelling. Then the reporting identity needs to be set to ‘blended’, as this allows for the combination of modelled and real data. Finally, there is a requirement for the property to reach a certain amount of usage, the property needs to collect at least 1,000 events per day with analytics_storage='denied' for at least 7 days. The property also has to have at least 1,000 daily users sending events with analytics_storage='granted' for at least 7 of the previous 28 days. Google says it may take more than 7 days of meeting the data threshold within those 28 days to train the model successfully; however, it's possible that even the additional data won't be sufficient for Analytics to train the model.
- Behavioural Modelling: GA4 UI uses machine learning to estimate user and session metrics when consent is not granted. This modelled data helps fill gaps in the GA4 UI but is not exported to BigQuery. Thus, BigQuery relies solely on raw, observed data, which may lead to discrepancies between GA4 UI reports and BigQuery exports.
- Conversion Modelling: Similar to behavioural modelling, it estimates conversions for unobservable data but does not apply to BigQuery exports, impacting the accuracy of conversion metrics.
3. User and Session Metrics:
- User Metrics: In BigQuery, the absence of user_pseudo_id for unconsented events leads to an underestimation of user counts. This requires careful handling when querying user data to avoid skewed results.
- Session Metrics: Session calculations can be affected as well. Since sessions are tied to user identifiers combined with session IDs, both of which are missing, which therefore can result in fragmented session data.
- Clarity and Education: When building out user and session metrics via BigQuery for reports, it is important to be clear of the above considerations. When grouping by user_pseudo_id or session IDs to build metrics, those metrics will only be Observed Users, or Consented Users, with the same classification for Sessions.
Best Practices and Considerations
- Implement Advanced Consent Mode: To mitigate data loss, use Advanced Consent Mode to track events without storing personal identifiers. This helps maintain data quality while complying with privacy regulations. Chat to Fresh Egg today if you need help with implementing Advanced Consent Mode yourself.
- Monitor and Compare Metrics: Regularly compare metrics between the GA4 UI and BigQuery to understand the impact of consent mode. Adjust your queries and data models accordingly to account for missing identifiers.
- Use Behavioural and Conversion Models: While these models do not directly impact BigQuery data, they provide valuable insights in the GA4 UI that can inform your analysis and help bridge data gaps.
- Legal and Compliance Review: Ensure your implementation of consent mode aligns with legal requirements and involves your legal team to avoid potential compliance issues.
By understanding and adapting to these nuances, you can effectively manage the impact of Consent Mode on your GA4 data in BigQuery, ensuring both compliance and data integrity.
Conclusion
The BigQuery GA4 daily export schema might seem like a daunting puzzle at first, but with some practice, you'll navigate it like a pro. By understanding the key components and leveraging practical examples, you can unlock the full potential of your GA4 data. So go ahead, put on your detective hat, and explore the rich insights hidden in your data.
If you need a Watson to help crack the case (or Eggs!), drop us a message; we will be happy to help.