shakedown.social is one of the many independent Mastodon servers you can use to participate in the fediverse.
A community for live music fans with roots in the jam scene. Shakedown Social is run by a team of volunteers (led by @clifff and @sethadam1) and funded by donations.

Administered by:

Server stats:

291
active users

#bigquery

0 posts0 participants0 posts today

FFS. Turns out (after I built a feature) that you can't supply a schema for BigQuery Materialised Views.

> Error: googleapi: Error 400: Schema field shouldn't be used as input with a materialized view, invalid

So it's impossible to have column descriptions for MVs? That sucks.

Whilst migrating our log pipeline to use the BigQuery Storage API & thus end-to-end streaming of data from Storage (GCS) via Eventarc & Cloud Run (read, transform, enrich - NodeJS) to BigQuery, I tested some big files, many times the largest we've ever seen in the wild.

It runs at just over 3 log lines/rows per millisecond end-to-end (i.e. inc. writing to BigQuery) over 3.2M log lines.

Would be interested to know how that compares with similar systems.

After several iterations, I think I've finally got my log ingest pipeline working properly, at scale, using the #BigQuery Storage API.
Some complications with migrating from the "legacy" "streaming" (it's not in the sense of code) API have been really hard to deal with e.g.:
* A single row in a write fail means the entire write fails
* SQL column defaults don't apply unless you specifically configure them to
* 10MB/write limit
I rewrote the whole thing today & finally things are looking good! 🤞

Continued thread

...and after even more debugging, it turns out that the reason for BigQuery `appendRows()` write failures was that large write data volumes (~5k rows or ~5MB of data) exceed the (undocumented) default `createStreamConnection()` timeout 🤦🏻‍♂️.

Even the units of the config option (once you find it) are not documented 🤦🏻‍♂️🤦🏻‍♂️. It's in milliseconds as it turns out.

I upped the timeout to 120s and the failures go away. FFS.

cloud.google.com/nodejs/docs/r

Google CloudClass managedwriter.WriterClient (4.10.0)  |  Node.js client library  |  Google Cloud

After a goodly amount of debugging, it turns out that the BigQuery Storage API (for NodeJS at least) function `appendRows()` (which is how you tell it to write data to BigQuery) fails every time if you give it "too much" data.

This is not documented (and the docs are somewhat minimal for such an important function). I have an open case with Google & have asked for the docs to be improved.

Thought it might help others to note this.

cloud.google.com/nodejs/docs/r

Google CloudClass v1.BigQueryWriteClient (4.10.0)  |  Node.js client library  |  Google Cloud

I deployed (then had to revert) an update to the log processing pipeline which ingests our CDN access logs in Google Cloud Run then writes to BigQuery.
The change migrated from the "legacy" BigQuery API to the Storage API. Thought it was worth sharing some write performance improvements seen in the Storage API:
* P99 0.5s (30%) lower
* p95 155ms (25%) lower
* p75 58ms (20%) lower
* p50 43ms (43%) *higher*
That's on 1/2 size Run containers so higher p50 is worth it (⬇️)
#BigQuery #GoogleCloud #BBC

The BigQuery Storage API has (AFAIK) undocumented behaviour as-goes write failures which differs *dramatically* from the "legacy API":

If you tell the BQ Storage API to `appendRows()` and *any* rows fail to be written (due to e.g. data type/range incompatibility), *all* rows will in fact not to be written to BQ - even though the (Node) lib will only tell you the rows which are incompatible.

It took me some time to find this out. Hoping it helps someone.

Long shot...
Is anyone using the Node JS SDK for the BigQuery Storage API with `appendRows()`?

It seems to fail *all* writes if >= 1 row passed to `appendRows()`fails which makes it unusable for me.

The "legacy" API has a `skipInvalidRows`option but I can't find one with the Storage API. I've raised a ticket but hoping someone'll know.

I've been migrating our log ingest pipeline from writing (JSON) to BigQuery via `table.insert()` to using the BigQuery Storage API (which converts to a protobuf representation of the JSON you feed it).

The example code in docs is god-awful but I have persevered & made it work.

Storage API:
- Is >2x faster thus far
- Is ~10x cheaper (IIRC)
- Needs timestamps with usec res, `table.insert()` is in seconds 🤷🏼‍♂️

Pleased with it despite the pain. I'll polish it next week!
#BigQuery #NodeJS #Data #BBC

During last night's England vs Denmark match, we hit just over 750k log lines/second (mean minutely, the real peak will've been higher) on my log pipeline. The minutely peak was 45.1M log lines/minute.
Bearing in mind this only processes our web edge logs & a small subset of our media logs (plus supporting services) & the match wasn't super busy, that's quite a lot. I'll see if I can find out what the overall peak was. I'm guessing way into the millions
#WebDev #Data #BigQuery #BBC #Euros

Weird...I have been working to migrate our log processing pipeline to use streams all the way from the log file itself to BigQuery...got it all working and was very pleased until I deployed it to the dev env and writes all failed due to lack of `bigquery.tables.create` permisison on the svc account (despite the destination tables existing).
Added that and it works.
Seems `table.createWriteStream()` (node) uses an implicit `create if not exists` or similar.
#BigQuery #Node