Data Transfer from Amazon S3 to BigQuery

Tadashi Shigeoka ·  Sat, June 12, 2021

I’ll introduce how to transfer data from Amazon S3 to BigQuery.

Google Big Query

Background: Mainly AWS but Want to Use BigQuery Only for DWH

The background is that I mainly use AWS but want to use BigQuery only for DWH, so I’ll introduce information that was helpful for implementing data transfer from Amazon S3 to BigQuery.

Amazon S3 → BigQuery Data Transfer Reference Articles

First, if you want to run the flow of transferring CSV files from S3 to BigQuery and loading them into tables, you can quickly implement it to a working state by referring to the article BigQuery Data Transfer Service で AWS S3 から BigQuery に CSV データをロードしてみた | DevelopersIO.

Additionally, it’s good to read the official documentation Amazon S3 転送 | BigQuery Data Transfer Service | Google Cloud as well.

Amazon S3 → BigQuery Data Transfer Knowledge

"On-demand" Cannot Guarantee Real-time Processing

Amazon S3 Data Transfer Setup “On-demand” has uncertain execution timing, so it’s better to assume that real-time processing cannot be guaranteed.

Wildcard * Ending Won't Transfer, So Specify the Extension

Effects of prefix matching or wildcard matching

The Amazon S3 API supports prefix matching but not wildcard matching. All Amazon S3 files that match the prefix are transferred to Google Cloud. However, only files that match the Amazon S3 URI in the transfer configuration are actually loaded into BigQuery.

s3://your_bucket/your_folder/*

Files in S3 won’t be transferred with just a wildcard *, so you need to explicitly specify the extension .avro in addition to the wildcard like:

s3://your_bucket/your_folder/*.avro

[File Format] Avro Recommended

For [File format], select the data format. You can choose from JSON (newline delimited), CSV, Avro, Parquet, or ORC.

Check DATE, TIMESTAMP, DATETIME Formats

Time-unit column partitioning Table partitions can be divided by DATE, TIMESTAMP, or DATETIME columns in the table. When you write data to the table, BigQuery automatically populates the correct partition based on the column values.

Source: 時間単位列パーティショニング - パーティション分割テーブルの概要 | BigQuery | Google Cloud

  • The date information properties mentioned above must exist in the data source
  • The DATE, TIMESTAMP, DATETIME formats must be supported by BigQuery

That’s all from the Gemba about transferring data from Amazon S3 to BigQuery.