BigQuery Data Engineering

Using LEFT JOIN with UNNEST in BigQuery: A Comprehensive Guide

When working with BigQuery, it’s common to encounter scenarios where you need to join datasets. A unique feature of BigQuery is its ability to handle nested and repeated fields using the UNNEST function. In this blog, we’ll walk through how to use a LEFT JOIN with an unnested attribute, explaining each term and concept in detail.


The Problem

Consider the following scenario: you have two tables:

orders Table:

order_idcustomer_idproducts
1101[“apple”, “banana”]
2102[“orange”, “grape”]
3103NULL

product_details Table:

product_nameprice
apple1.5
banana1.0
orange2.0
grape3.0

Our goal is to join these tables such that we get a detailed list of all products in each order, along with their respective prices. If an order has no products, it should still appear in the result.


The Query

Here’s how we can achieve this:

SELECT
  o.order_id,
  o.customer_id,
  p.product_name,
  pd.price
FROM
  `project.dataset.orders` o
LEFT JOIN
  UNNEST(o.products) AS p
LEFT JOIN
  `project.dataset.product_details` pd
ON
  p = pd.product_name;

Understanding the Terminology

  1. LEFT JOIN:
  • A type of SQL join that combines rows from two tables.
  • Ensures that all rows from the left table (orders) appear in the result, even if there’s no matching row in the right table (product_details).
  1. UNNEST:
  • A BigQuery-specific function that expands an array into a set of rows.
  • In our example, the products column in the orders table is an array (e.g., ["apple", "banana"]).
  • UNNEST(o.products) creates one row for each product in the array.
  1. Alias (AS p):
  • Assigns a short name (alias) to the result of the UNNEST operation.
  • Here, p represents individual product names from the products array.
  1. ON Clause:
  • Specifies the condition for joining tables.
  • In this case, p = pd.product_name ensures that each product name from the UNNEST operation is matched with its corresponding entry in the product_details table.

The Result

The query produces the following output:

order_idcustomer_idproduct_nameprice
1101apple1.5
1101banana1.0
2102orange2.0
2102grape3.0
3103NULLNULL

Explanation of the Output

  • For order_id = 1, the products array contains apple and banana, so we see one row per product with the corresponding prices.
  • For order_id = 2, the products array contains orange and grape.
  • For order_id = 3, the products array is NULL, so the UNNEST operation produces no rows, but the LEFT JOIN ensures that the row from the orders table still appears with NULL values for product_name and price.

Why Use LEFT JOIN with UNNEST?

  1. Handle Nested Data: BigQuery’s UNNEST function allows you to work seamlessly with nested and repeated fields, which are common in JSON or NoSQL datasets.
  2. Preserve Left Table Rows: The LEFT JOIN ensures no rows are lost from the primary (left) table, even if there’s no matching data in the joined table.
  3. Scalable Operations: BigQuery’s architecture is optimized for such operations, making it efficient for large datasets.

Key Takeaways

  • BigQuery’s UNNEST function is essential for working with arrays.
  • LEFT JOIN ensures data integrity by including all rows from the left table, even if there are no matches.
  • Combining LEFT JOIN with UNNEST allows you to flatten nested data while enriching it with additional details from other tables.

This combination is a powerful tool in BigQuery for working with structured and semi-structured data. By mastering these concepts, you can unlock the full potential of BigQuery for complex data transformations.

Avatar

Neelabh

About Author

As Neelabh Singh, I am a Senior Software Engineer with 6.6 years of experience, specializing in Java technologies, Microservices, AWS, Algorithms, and Data Structures. I am also a technology blogger and an active participant in several online coding communities.

You may also like

Data Engineering

Understanding PCollection in Apache Beam: The Backbone of Scalable Data Pipelines

In the world of distributed data processing, Apache Beam has emerged as a robust framework for building scalable, unified data