I’ll introduce how to transfer data from Amazon S3 to BigQuery.
The background is that I mainly use AWS but want to use BigQuery only for DWH, so I’ll introduce information that was helpful for implementing data transfer from Amazon S3 to BigQuery.
First, if you want to run the flow of transferring CSV files from S3 to BigQuery and loading them into tables, you can quickly implement it to a working state by referring to the article BigQuery Data Transfer Service で AWS S3 から BigQuery に CSV データをロードしてみた | DevelopersIO.
Additionally, it’s good to read the official documentation Amazon S3 転送 | BigQuery Data Transfer Service | Google Cloud as well.
Amazon S3 Data Transfer Setup “On-demand” has uncertain execution timing, so it’s better to assume that real-time processing cannot be guaranteed.
Effects of prefix matching or wildcard matchingThe Amazon S3 API supports prefix matching but not wildcard matching. All Amazon S3 files that match the prefix are transferred to Google Cloud. However, only files that match the Amazon S3 URI in the transfer configuration are actually loaded into BigQuery.
s3://your_bucket/your_folder/*
Files in S3 won’t be transferred with just a wildcard *, so you need to explicitly specify the extension .avro in addition to the wildcard like:
s3://your_bucket/your_folder/*.avro
For [File format], select the data format. You can choose from JSON (newline delimited), CSV, Avro, Parquet, or ORC.
Time-unit column partitioning Table partitions can be divided by DATE, TIMESTAMP, or DATETIME columns in the table. When you write data to the table, BigQuery automatically populates the correct partition based on the column values.Source: 時間単位列パーティショニング - パーティション分割テーブルの概要 | BigQuery | Google Cloud
That’s all from the Gemba about transferring data from Amazon S3 to BigQuery.