shakedown.social is one of the many independent Mastodon servers you can use to participate in the fediverse.
A community for live music fans with roots in the jam scene. Shakedown Social is run by a team of volunteers (led by @clifff and @sethadam1) and funded by donations.

Administered by:

Server stats:

253
active users

#reliability

4 posts4 participants0 posts today

✨ AI’s ‘reasoning’ is more mirror than mind—but that’s okay!
This ASU study reveals that LLMs’ "chain-of-thought" abilities are pattern-based illusions, not true logic.
The authors warn of a "false aura of reliability" in AI outputs, which could mislead in fields like healthcare or finance.
While that might sound disappointing, it’s actually useful insight! Understanding these limits can help us:
✅ Build better guardrails for AI in critical applications.
✅ Develop tests to expose AI’s blind spots.
✅ Shift focus from "human-like thinking" to reliable, transparent outputs.

Big thanks to @agnieszkaserafinowicz for sharing! 🙌

Read the full study:
Is Chain-of-Thought Reasoning of LLMs a Mirage?
arxiv.org/pdf/2508.01191

@ai@a.gup.pe @ai@misskey.io @openscience @artificial_intel @ai@newsmast.community @alphasignal.ai #AI #research #disinformation #LLM #languagemodel #thinking #Science #news #artificialIntelligence #technology #AIRisk #LLM #Science #TechDebate #falseReliability #reliability

If they're gonna use "post-mortem" then I say use "postmortem". Make it a different word, refuse to align it with death.

Post-Incident Review is always better.

Learning Review isn't always the same thing for a lot of companies, sometimes you need both.

But whatever you call it, just do it.

→ Why I'm Betting Against AI Agents in 2025 (Despite Building Them)
utkarshkanwat.com/writing/bett

“[E]rror compounding makes autonomous multi-step workflows mathematically #impossible at #production scale. […] Production systems need 99.9%+ #reliability. Even if you magically achieve 99% per-step reliability (which no one has), you still only get 82% success over 20 steps. This isn't a prompt #engineering problem. This is #mathematical reality.”

Utkarsh Kanwat · Why I'm Betting Against AI Agents in 2025 (Despite Building Them)I've built 12+ AI agent systems across development, DevOps, and data operations. Here's why the current hype around autonomous agents is mathematically impossible and what actually works in production.
#AI#Agents#prompt

Staff SRE available for work!!!

I am a hard working systems thinker who has a unique balance of seasoned TechOps skills, good DevEx chops, experience designing and running SRE programs like Observability, Incidents, and CI/CD.

I was put out of work in June and I need a new gig in short order. Boosts and cross-platform posts appreciated!

Useful paper investigating the precision of various #reliability and #MeasurementError parameters under different conditions and study designs:
link.springer.com/article/10.1

It comes with an #RStats shiny tool to explore some of these oneself:
iriseekhout.shinyapps.io/ICCpo

SpringerLinkSample size recommendations for studies on reliability and measurement error: an online application based on simulation studies - Health Services and Outcomes Research MethodologySimulation studies were performed to investigate for which conditions of sample size of patients (n) and number of repeated measurements (k) (e.g., raters) the optimal (i.e., balance between precise and efficient) estimations of intraclass correlation coefficients (ICCs) and standard error of measurements (SEMs) can be achieved. Subsequently, we developed an online application that shows the implications for decisions about sample sizes in reliability studies. We simulated scores for repeated measurements of patients, based on different conditions of n, k, the correlation between scores on repeated measurements (r), the variance between patients’ test scores (v), and the presence of systematic differences within k. The performance of the reliability parameters (based on one-way and two-way effects models) was determined by the calculation of bias, mean squared error (MSE), and coverage and width of the confidence intervals (CI). We showed that the gain in precision (i.e., largest change in MSE) of the ICC and SEM parameters diminishes at larger values of n or k. Next, we showed that the correlation and the presence of systematic differences have most influence on the MSE values, the coverage and the CI width. This influence differed between the models. As measurements can be expensive and burdensome for patients and professionals, we recommend to use an efficient design, in terms of the sample size and number of repeated measurements to come to precise ICC and SEM estimates. Utilizing the results, a user-friendly online application is developed to decide upon the optimal design, as ‘one size fits all’ doesn’t hold.

I really liked this informal community poll and thematic analysis on SLO usage. It does a better job at highlighting the hurdles to adopting them at a Company Who Is Not Google than a lot of "Here's how to do SLOs" pieces that just don't cover it.

If there is ever a "Seeking SLOs" book, this should be the first chapter.

ericmustin.substack.com/p/note

A Small, Good Thing · Notes on Service Level ObjectivesBy Eric Mustin
Continued thread

Update. From @hildabast: "What if We Can’t Rely on PubMed?"
absolutelymaybe.plos.org/2025/

"#PubMed is incredibly reliable…That said, between the risks of an exodus of key personnel, understaffing, or goodness-knows-what vandalism when a goon squad arrives at NIH, it’s not paranoid any more to think ahead to the once-unthinkable. What would PubMed enshittification look like? Could PubMed go down more often, and for longer? Might services no longer be free? How else could the #quality and #reliability of its services be degraded?"

Absolutely Maybe · What if We Can't Rely on PubMed? - Absolutely MaybePubMed is incredibly reliable. And a lot depends on it. It’s an ecosystem built around MEDLINE, the steady feed of new publications…