DEV Community dev.to community dev-to software-dev technology 2026-06-18 21:40

↗

If you've ever had to encrypt a nationalId, a creditCardNumber, or a medicalRecord field in a Spring Boot entity, you already know the drill. You write an AttributeConverter, you wire up a Cipher instance, you generate an IV, you figure out where the key lives, you get the...

If you've ever had to encrypt a nationalId, a creditCardNumber, or a medicalRecord field in a Spring Boot entity, you already know the drill. You write an AttributeConverter, you wire up a Cipher instance, you generate an IV, you figure out where the key lives, you get the GCM tag handling wrong once, you fix it, and three weeks later you finally trust it enough to ship.

We've done this enough times — across healthcare and fintech projects — that we stopped doing it manually. This post walks through the full implementation from scratch, the mistakes that are easy to make along the way, and then shows the one-annotation version we eventually packaged into Nucleus, our open-core Java framework.

Why GCM, and not just AES-CBC

If you search "AES encryption Java" you'll find a lot of CBC-mode examples. Don't use them for new code. CBC gives you confidentiality but no integrity check — an attacker can flip bits in the ciphertext and you won't know it happened until something downstream breaks in a weird way, or worse, doesn't break at all.

GCM (Galois/Counter Mode) gives you both confidentiality and authentication in one pass. It produces an authentication tag alongside the ciphertext, and decryption fails loudly if either the ciphertext or the tag has been tampered with. It's also the mode behind TLS 1.3, which is a reasonable signal that it's held up to scrutiny. The relevant specification is NIST SP 800-38D.

Building it by hand

Here's a minimal, correct implementation. This is the version you'd write before you have a framework to lean on.

public class AesGcmEncryptor {

    private static final String ALGORITHM = "AES/GCM/NoPadding";
    private static final int GCM_TAG_LENGTH_BITS = 128;
    private static final int GCM_IV_LENGTH_BYTES = 12;

    private final SecretKey key;

    public AesGcmEncryptor(SecretKey key) {
        this.key = key;
    }

    public String encrypt(String plaintext) {
        try {
            byte[] iv = new byte[GCM_IV_LENGTH_BYTES];
            SecureRandom.getInstanceStrong().nextBytes(iv);

            Cipher cipher = Cipher.getInstance(ALGORITHM);
            GCMParameterSpec spec = new GCMParameterSpec(GCM_TAG_LENGTH_BITS, iv);
            cipher.init(Cipher.ENCRYPT_MODE, key, spec);

            byte[] ciphertext = cipher.doFinal(plaintext.getBytes(StandardCharsets.UTF_8));

            // Prepend the IV so it travels with the ciphertext
            ByteBuffer buffer = ByteBuffer.allocate(iv.length + ciphertext.length);
            buffer.put(iv).put(ciphertext);

            return Base64.getEncoder().encodeToString(buffer.array());
        } catch (GeneralSecurityException e) {
            throw new EncryptionException("Failed to encrypt field", e);
        }
    }

    public String decrypt(String encoded) {
        try {
            byte[] decoded = Base64.getDecoder().decode(encoded);
            ByteBuffer buffer = ByteBuffer.wrap(decoded);

            byte[] iv = new byte[GCM_IV_LENGTH_BYTES];
            buffer.get(iv);
            byte[] ciphertext = new byte[buffer.remaining()];
            buffer.get(ciphertext);

            Cipher cipher = Cipher.getInstance(ALGORITHM);
            GCMParameterSpec spec = new GCMParameterSpec(GCM_TAG_LENGTH_BITS, iv);
            cipher.init(Cipher.DECRYPT_MODE, key, spec);

            byte[] plaintext = cipher.doFinal(ciphertext);
            return new String(plaintext, StandardCharsets.UTF_8);
        } catch (GeneralSecurityException e) {
            throw new EncryptionException("Failed to decrypt field — data may be tampered or corrupted", e);
        }
    }
}

A few details that matter and are easy to skip past:

The IV must be random and unique per encryption call, but it doesn't need to be secret. Prepending it to the ciphertext (as above) is the standard approach — you need it to decrypt, and it carries no information about the key.
12 bytes (96 bits) is the recommended IV length for GCM. Using a different length is technically possible but changes the internal computation in ways that reduce the security margin. Don't deviate without a specific reason.
Never reuse an IV with the same key. This is the one mistake that actually breaks GCM's security guarantees — IV reuse can leak the authentication key. If you're generating IVs randomly with SecureRandom per call, this is a non-issue at realistic volumes.
SecureRandom.getInstanceStrong(), not new Random(). This trips people up constantly. Random is not cryptographically secure and predictable IVs undermine the whole scheme.

Wiring it into JPA

The encryptor above is useless until it's actually applied to entity fields. The idiomatic way is a JPA AttributeConverter:

@Converter
public class EncryptedStringConverter implements AttributeConverter<String, String> {

    private final AesGcmEncryptor encryptor;

    public EncryptedStringConverter(AesGcmEncryptor encryptor) {
        this.encryptor = encryptor;
    }

    @Override
    public String convertToDatabaseColumn(String attribute) {
        return attribute == null ? null : encryptor.encrypt(attribute);
    }

    @Override
    public String convertToEntityAttribute(String dbData) {
        return dbData == null ? null : encryptor.decrypt(dbData);
    }
}

Then on your entity:

@Entity
public class Patient {

    @Id
    private UUID id;

    @Convert(converter = EncryptedStringConverter.class)
    private String nationalId;

    @Convert(converter = EncryptedStringConverter.class)
    private String diagnosis;

    // standard fields, no encryption needed
    private String firstName;
}

This works. It's also the point where most teams stop, because it's "good enough" — and it is, functionally. But there are a few gaps that tend to surface a few months later:

Key management is now manually wired through a constructor, which usually means it ends up in application.yml or, worse, hardcoded for "just this once" during a demo.
There's no audit trail. If a regulator or auditor asks "who decrypted this patient's data and when," a plain AttributeConverter gives you nothing.
Querying encrypted fields breaks. WHERE nationalId = ? no longer works because the database only sees ciphertext. Every team rediscovers this independently, usually in a sprint planning meeting that goes long.
You're now repeating this converter wiring on every sensitive field, in every entity, in every project.

None of these are hard problems individually. But solving all four, correctly, every time, across every project, is exactly the kind of thing that's worth doing once and reusing.

The one-annotation version

This is what we ended up building into Nucleus's encryption module:

@Entity
public class Patient extends BaseEntity {

    @SensitiveData
    private String nationalId;

    @SensitiveData
    private String diagnosis;

    private String firstName;
}

@SensitiveData triggers AES-256-GCM encryption at the field level automatically — same algorithm, same IV handling, same NIST SP 800-38D-aligned implementation described above, but the boilerplate is gone. A few things it handles that the hand-rolled version above doesn't:

Key management is pulled from a configured key provider rather than threaded through constructors. Key rotation is supported without a data migration.
Deterministic lookup hashes are generated alongside the encrypted value for fields you need to query on, so WHERE nationalId = ?-style lookups keep working without decrypting the whole table.
It plugs into the GDPR module, so encrypted fields are automatically eligible for the consent and retention policies tracked there — encryption and compliance aren't two separate systems you have to keep in sync.

The annotation isn't doing anything magical — it's the same AttributeConverter pattern under the hood, generated and wired automatically based on what we kept rebuilding by hand across projects.

When you don't need any of this

If you're encrypting one or two fields in a side project, the hand-rolled AesGcmEncryptor above is genuinely fine — there's no need to pull in a framework for that. Where this starts to matter is when you have GDPR or HIPAA obligations across a real schema, multiple entities with sensitive fields, and an actual audit requirement. That's the point where doing it five times by hand starts costing more than the framework would.

If you want to see the full module — including the key rotation flow and the GDPR integration — it's part of the open-core release:

Docs: clinvio.hu/nucleus/docs
Source: open-core on GitHub (link in the repo)

Happy to go deeper on the key rotation mechanics or the deterministic-hash lookup approach in the comments if anyone's curious — those are the two parts that generate the most questions when we walk through this internally.

DEV Community dev.to community dev-to software-dev technology 2026-06-18 21:35

↗

DuckDB 1.4.5 LTS, pgEdge ColdFront Beta, and SQLite's FCNTL_PDB Internals Today's Highlights This week's highlights feature the latest DuckDB 1.4.5 LTS release, a new open-source beta for PostgreSQL data tiering, and a deep dive into an obscure SQLite internal file control...

DuckDB 1.4.5 LTS, pgEdge ColdFront Beta, and SQLite's FCNTL_PDB Internals

Today's Highlights

This week's highlights feature the latest DuckDB 1.4.5 LTS release, a new open-source beta for PostgreSQL data tiering, and a deep dive into an obscure SQLite internal file control operation. These updates offer performance, architectural flexibility, and internal insights across the SQLite ecosystem.

Announcing DuckDB 1.4.5 LTS (Andium) (DuckDB Blog)

Source: https://duckdb.org/2026/06/17/announcing-duckdb-145.html

The latest Long Term Support (LTS) release of DuckDB, version 1.4.5 named "Andium", has been announced, primarily focusing on bugfixes and performance enhancements. DuckDB, an in-process analytical processing database, continues to refine its engine for enhanced stability and efficiency in embedded and edge computing environments. While the announcement is concise, LTS releases are crucial for developers and organizations relying on a stable and well-tested version for their data pipelines and analytical workloads, ensuring long-term compatibility and reliability.

This update is vital for maintaining the robustness of applications that utilize DuckDB for local data transformations, complex analytical queries, and other high-performance data operations. Users of previous 1.4.x versions are encouraged to upgrade to benefit from the accumulated stability improvements and minor speedups, all without introducing major breaking changes. This commitment to incremental improvements and stable releases solidifies DuckDB's position as a premier solution for embedded analytical database needs, making it a reliable choice for critical projects.

Comment: An LTS release, even with bugfixes, is always welcome from DuckDB. It reinforces their commitment to a stable and performant analytical database that I frequently use for local data processing and reporting.

Introducing ColdFront: Seamlessly Uniting OLTP, Analytics and AI Workloads on PostgreSQL (Planet PostgreSQL)

Source: https://postgr.es/p/9mf

pgEdge has announced the beta release of ColdFront v1.0.0-beta1, an innovative open-source solution designed to provide transparent data tiering for PostgreSQL. ColdFront's primary goal is to seamlessly integrate OLTP, analytical, and AI workloads directly within a single PostgreSQL instance, eliminating the need for application code changes. It achieves this by intelligently identifying and moving data between various storage tiers based on access patterns and data age, thereby optimizing both cost efficiency and query performance.

This tool is particularly significant for applications that leverage PostgreSQL with extensions like pgvector for AI functionalities. ColdFront enables efficient management of vector embeddings and other data types across diverse storage infrastructures—from high-performance SSDs to more economical object storage—all while presenting a unified data view to the application. The ability to effectively tier data for both frequently accessed transactional data and less-frequently accessed analytical or AI datasets offers substantial advantages for scaling PostgreSQL deployments, especially as data volumes expand and AI integrations become more complex. The beta release invites developers to experiment with its capabilities for simplifying advanced data management challenges in real-world production settings.

Comment: A transparent data tiering solution for PostgreSQL that explicitly supports pgvector and AI workloads is a significant step forward. The promise of 'no application code changes' makes ColdFront a highly practical tool for managing large, mixed-workload databases.

Why does SQLITE_FCNTL_PDB exist? (SQLite Forum)

Source: https://sqlite.org/forum/info/cbad8f0a383d8fa29f43c18642002aa8f67abfdf6479e41dbbe2295becdfb9fb

A recent post on the SQLite forum initiates a discussion regarding the purpose and existence of the SQLITE_FCNTL_PDB file control operation. This query delves deep into the internal mechanisms of SQLite, specifically how it handles Program Database (PDB) information, which is primarily utilized for debugging symbols on Windows platforms. Investigating such low-level fcntl (file control) operations provides critical insights into SQLite's robust cross-platform compatibility, its intricate build processes, and its nuanced interactions with underlying operating system features.

The presence of SQLITE_FCNTL_PDB suggests specific optimizations or integrations that SQLite performs, likely for debugging purposes or in response to particular compiler directives on Windows systems. For developers who frequently work with SQLite's source code, are involved in porting the database to new environments, or require advanced debugging capabilities, understanding these less-common fcntl flags is essential. Such discussions illuminate the meticulous engineering and attention to detail embedded within SQLite, enabling it to operate reliably and efficiently across a diverse range of computing environments while maintaining its renowned performance and stability. These internal explorations deepen one's appreciation for SQLite's sophisticated architectural design.

Comment: Delving into a specific SQLITE_FCNTL_PDB flag might seem niche, but it's exactly the kind of SQLite internals discussion that reveals how robust and adaptable the engine is, especially for platform-specific interactions and debugging.

DEV Community dev.to community dev-to software-dev technology 2026-06-18 21:34

↗

Introduction While all code should be efficient, code for library-like components, especially involving loops, should be as efficient as possible since such code is often widely used. In my A Simple Dynamic Array for C, I included the source code for a function to clean-up a...

Introduction

While all code should be efficient, code for library-like components, especially involving loops, should be as efficient as possible since such code is often widely used.

In my A Simple Dynamic Array for C, I included the source code for a function to clean-up a dynamic array:

void array_cleanup( array_t *array, array_free_fn_t free_fn ) {
  if ( array == NULL )
    return;
  if ( free_fn != NULL ) {
    char *element = array->elements;
    for ( size_t i = 0; i < array->len; ++i ) {
      (*free_fn)( element );
      element += array->esize;
    }
  }
  free( array->elements );
  array_init( array, array->esize );
}

While this code is correct and good enough for pedagogical purposes, it’s not optimal. Before I tell you why, see if you can figure out why. Hint: it has to do with the use of both the array pointer and the function call in the loop.

For those who might not get the reference in this article’s title, it’s a play on a line from The Wizard of Oz. The original line was “Lions and tigers and bears! Oh, my!”

Loop Refresher

In C (and languages inspired by C including C++, C#, Go, and Java), for loop conditions are reevaluated on every loop iteration. For example, in:

for ( int i = 0; i < n; ++i )

the condition i < n is evaluated on every iteration. That means if n is changed within the loop, it could terminate either earlier or later than it was initially supposed to — and this is well-defined behavior.

This is in contrast to some older languages like Fortran and Pascal as well as some newer languages like Rust where the loop’s termination value is evaluated once just prior to the start of the loop. For example, in Pascal:

for i := 1 to n do

is actually treated as if it were this:

limit := n;
for i := 1 to limit do

Note that if the loop condition is a more complicated expression, such as calling a function:

for ( int i = 0; i < f(x); ++i )

then the function will be called on every loop iteration.

Except if the function is marked as unsequenced in C23 in which case the compiler can hoist its call out of the loop.

(Have you figured out the “why” yet?)

Pointers

Wherever pointers are involved, things get more complicated. From an optimization perspective, pointers are like sand in your gears. Returning to the loop in array_cleanup:

for ( size_t i = 0; i < array->len; ++i ) {
  (*free_fn)( element );
  element += array->esize;
}

That gets compiled by clang into this x86-64 assembly:

.LBB0_4:
    mov  rdi, r15
    call r14                      ; (*free_fn)( element )
    add  r15, qword ptr [rbx + 8] ; element += array->esize
    inc  r12                      ; ++i
    cmp  r12, qword ptr [rbx + 16]; i < array->len
    jb   .LBB0_4

The things to notice are the qword ptr lines which means the code has to dereference array (which means read from memory) twice on every iteration. That’s slow.

The problem is that the compiler can’t know for sure that the array object pointed to by array isn’t modified by the function pointed to by free_fn. For all the compiler knows, the function has access to a global copy of the pointer and could modify the array. Compilers have to play it safe.

If there were no function call in the loop, then the compiler could safely hoist both len and esize outside the loop as if the code were:

size_t const esize = array->esize;
size_t const len = array->len;
for ( size_t i = 0; i < len; ++i ) {
  // ... do something other than call a function ...
  element += esize;
}

At best, the compiler can put esize and len into registers; but even at worst, if the compiler puts them into local variables, it would still save two pointer dereferences per iteration.

`restrict` Revisited

Is there any way to tell the compiler that free_fn won’t modify the array? Why yes, there is: restrict:

void array_cleanup( array_t *restrict array,
                    array_free_fn_t free_fn ) {
  // ... otherwise the same as before ...
}

(If you are unfamiliar with restrict, you should read my previous article about it first.)

Even if you are familiar with restrict, you may be surprised to learn that it can help in this case. The canonical use case for restrict is memcpy:

void* memcpy( void *restrict dst, void const *restrict src, size_t n );

where there are two pointers marked restrict which means that they do not point to overlapping regions of memory, i.e., they don’t overlap with each other.

But in the case of array, how can restrict help when there is no other pointer in array_cleanup for array not to overlap with?

It turns out that restrict is more general than you might think. What the restrict in the modified array_cleanup means is that, during its execution, the dynamic array pointed to be array will not be modified by any pointer anywhere in the program, even if free_fn has a copy of such a pointer. Using restrict is a promise you make to the compiler. (If you break that promise, even unintentionally, the result is undefined behavior.)

With restrict, the assembly becomes:

.LBB0_4:
    mov  rdi, r12
    call r14                      ; (*free_fn)( element )
    add  r12, qword ptr [rbx + 8] ; element += array->esize
    dec  r13                      ; --i
    jne  .LBB0_4                  ; i > 0

The compiler has eliminated array->len dereference and is now counting down to zero.

But why didn’t the compiler optimize the remaining dereference (qword ptr) for esize?

First, similar to inline, restrict can be ignored by the compiler. Second, every CPU has only finite resources, most notably registers. In this case, the compiler presumably thought that the overall performance would still be better even if esize were not put into a register. If it were to be put into a register, then the compiler would have to ensure that it’s value is the same after free_fn is called as before — and presumably that was more costly than a dereference.

As I noted in my restrict article, restrict isn’t yet part of standard C++; however, many compilers support __restrict__ as an extension.

Explicit Caching

Of course you can forget about restrict and just use local variables to cache the values as was done initially:

size_t const esize = array->esize;
size_t const len = array->len;
for ( size_t i = 0; i < len; ++i ) {
  (*free_fn)( element );
  element += esize;
}

Then, despite the function call in the loop, the assembly becomes:

.LBB0_4:
    mov  rdi, rbx
    call r14
    add  rbx, r13
    dec  r15
    jne  .LBB0_4

with no dereferences.

Conclusion

If you want to optimize code using pointers, functions, and loops, consider using either restrict or explicit caching. But you should always check the generated assembly to ensure your tuning is having the desired effect.

Epilogue

If you’ve never heard of the Compiler Explorer, aka, godbolt.org, it’s an extremly useful in-browser tool for compiling C or C++ code with virtually any compiler you’ve ever heard of, at any optimization level, and showing you a mapping from source lines to their corresponding assembly lines for most any CPU architecture. (Why is it called godbolt? Because its creator is Matt Godbolt.)

DEV Community dev.to community dev-to software-dev technology 2026-06-18 21:25

↗

My agents were confidently wrong about the world, and I couldn't tell when. That's the part that got to me — not the wrongness, the confidence. I run my one-person company as a fleet of about twenty AI agents — a content writer, a finance one, a researcher, a security...

My agents were confidently wrong about the world, and I couldn't tell when. That's the part that got to me — not the wrongness, the confidence.

I run my one-person company as a fleet of about twenty AI agents — a content writer, a finance one, a researcher, a security officer, a handful more. They're good at the work I built them for. But every one of them shares a flaw I'd been papering over: when a task needs a fact about the world — how a tax threshold works, what a marketing framework actually says, how a platform bills — the model reaches into its training data and answers in the exact same self-assured tone whether it knows or is improvising. There is no tell. The guess and the fact wear the same face.

So this month I built the thing that was missing: a cited, fact-checked knowledgebase the agents have to read before they work, with a gate that keeps me from poisoning my own source of truth. Here's how it's built, the one rule that turned out to matter most, and the honest state of it — which is that I finished it days ago and have no idea yet whether it changes the work.

The job I was actually hiring this to do

Strip away my setup and the problem is one any solo operator using AI already has. You ask the model for something that depends on a real fact. It answers fluently. You either know enough to catch the error or you don't — and the whole reason you're asking is usually that you don't. The job I needed done wasn't "make my agents smarter." It was narrower and more honest: stop my AI from making things up in the one register where I can't catch it, and let me know which claims I can actually trust.

The competition for that job, in my shop, was "just let the model wing it and hope." That had already cost me. A marketing analysis once understated a channel's numbers because an agent trusted a stale figure instead of pulling the live one. Small, recoverable — but it's the recoverable ones you see. The ones you don't see are the ones that scare you.

What I built

It's a region of my notes vault I call the 10-Library — the factual half, deliberately walled off from the operational logbook (sessions, decisions, day-to-day state). One test decides what's allowed in: "Would this still be true if my company vanished tomorrow?" A fact about a copywriting framework: yes, it lives here. A fact about my own revenue: no, that's operational memory. The Library only holds world-true things.

The concrete shape, as of today:

110 notes, each one atomic — a single concept, not a topic dump — across five categories so far: infrastructure, Linux, networking, distribution (marketing), and freelancing.
Claim-level sourcing. Every factual sentence carries an inline citation to where it came from. No source, no entry. A note without a footnote isn't a Library fact; it's an opinion, and it doesn't get to sit in the source of truth.
A quarantine. New facts — researched from the web on a schedule, or pulled on demand when an agent hits a gap — don't land in the trusted set. They land in a _quarantine folder marked auto-unverified, and they stay invisible to the agents until I read them and promote them by hand. Right now there's exactly one note sitting in quarantine, waiting on me.

The notes are distilled from sources I'd actually defend — accredited course material and primary references, not a model free-associating about itself. The marketing notes I leaned on to write this very essay, for instance, cite Ogilvy's own books and a 1994 psychology paper, not a listicle.

The one rule that mattered most

If I keep only one sentence from this build, it's this: the agent reads the knowledgebase, but the agent never gets to silently rewrite it.

This sounds like bureaucratic caution. It's the opposite — it's the thing that keeps the whole structure from rotting. There's a documented failure mode where a language model asked to re-verify its own knowledge base will quietly degrade correct facts and ratify its own errors, all while looking like diligent maintenance. An AI grading its own homework drifts, and drifts confidently. So in my Library, automated re-checking is advisory and append-only: it can flag a note as stale or contradicted, it can stage a proposed change for me to look at — but it cannot overwrite a single fact on its own. The human promote step is not a nicety. It's the load-bearing wall.

The second half of the same rule: ingested web content is treated as data, never instructions. A note pulled from the internet can inform an answer; it can never trigger an action — no write, no tool call, no spend — without me confirming first. That single boundary quietly closes off the nightmare where someone poisons a web page my researcher reads and my fleet starts acting on the poison. The agents read the world. They don't take orders from it.

What this is not

It is not RAG-makes-the-AI-correct. A knowledgebase doesn't make a model truthful; it gives me a bounded, citable place where I've decided what's true, so that for the facts that matter I'm relying on something I vetted instead of something the model felt confident about. The gap between those two is the entire point. Outside the Library, my agents are exactly as fallible as yours. Inside it, at least, when one cites a fact I can click the footnote.

And it is brand new. I'm not going to tell you it made the fleet measurably better, because I genuinely don't know yet — it's been live for days, not months. What I can tell you is the failure it's designed to stop is real, I've been bitten by the cheap version of it, and the design is the one I'd defend: read before you work, cite every claim, and never let the machine quietly edit the truth.

The honest status, as always

This is the part of every one of these I refuse to skip, because the honesty is the actual product. Revenue is still zero. The audience is still tiny. The knowledgebase I just described was built this week — 110 notes is a respectable start and also nowhere near a finished reference; whole categories I sketched are still empty. The most alive thing in the whole operation remains the operating system itself plus one warm commission: a small desktop app I'm making for my mother, who'll pass it by word of mouth to the people she works with.

So treat this as a build log, not a victory lap. I built a cited knowledgebase so my AI workers would stop guessing in the register where I can't catch them, and I wired in the one rule — read it, don't rewrite it — that keeps me from becoming the thing that corrupts it. Whether it earns its keep, I'll find out in public and tell you either way.

If you use AI for anything load-bearing, here's the cheap version you can steal today: keep one file of facts you've personally checked, cite where each came from, point the model at it before it answers, and never let it edit that file without you reading the change. You don't need a vault or twenty agents. You need a place where "true" means you decided it, not the model.

I write up the operating system one piece at a time — the agent wiring, the failures, the rules I reverse — in Unbearable TechTips Weekly. It's practical homelab and agent ops, the real status included. If the mess is useful to you, that's where the specifics live.

— Noel @ Unbearable Labs

DEV Community dev.to community dev-to software-dev technology 2026-06-18 21:23

↗

Recently, while working on an in-progress open-source framework called Projector, I ran into a (not particularly novel) issue: one of it's internal packages (core) had grown during this period, and was not nearly as flyweight as it needed to be in the browser. The result was...

Recently, while working on an in-progress open-source framework called Projector, I ran into a (not particularly novel) issue: one of it's internal packages (core) had grown during this period, and was not nearly as flyweight as it needed to be in the browser. The result was 10-20kbs of unnecessary machinery getting pulled in.

I noticed this while running examples. I was consistently hitting a wall in bundle sizes that was surprisingly difficult to get past, even for someone as stubborn and relentless as I am. Naturally, I turned to Claude and ChatGPT to help me with this, and ended up using ChatGPT 5.5 with Codex as I find that, with the "precise" output mode, it tends to be a little more honest than Opus 4.8 these days.

I shared exported HAR network logs with it, having it go through the chunks to confirm where the bulk was; consistently, it confirmed that the issue was around an entangling of authoring/resolution code with runtime code in core that was pulling in too much to the browser.

The technical details here aren't really important, but I'm using them to illustrate a larger point.

We then iterated through a lot of different solutions—I setup a "goal" in codex with benchmarks to hit, and gave it a bunch of constraints, context, and tooling. Finally, after about 2-3 hours of looping against that goal, it completed.

Looking through the git diff, I noticed something odd—it had duplicated the result of the resolved module, so it could skip the resolution machinery and thus drop it from the browser bundle (again, technical details not really relevant).

It hit the rough kb benchmarks, respected all constraints, utilized all context and skills available, and avoided importing the machinery that we both aligned on being the core problem. It provided an elegant, coherent, well-written api, implemented a surgical, well-tested, well-designed solution, and convincingly defended its work when I queried about the implementation.

That sounds great, right? In fact, I think it highlights an emerging issue with frontier models—the ability to present sophisticated-looking, elegantly executed, not-wrong solutions with increasingly convincing reasoning, while simultaneously not being able to navigate ambiguity outside of a prompt's implicit parameters (yes, the problem is always user input and/or expectations, ie me; the LLM is just tooling).

May You Get What You Asked For

This, to me, is an increasing footgun in the advancement of LLMs, especially as the world seems to increasingly fixate on the abilities of LLMs to deliver "working" solutions and outcomes. It highlights how, as LLMs generate increasingly sophisticated output, it requires increasingly sophisticated design review and precise prompting to avoid potentially catastrophic land-mines hidden behind a very inhuman approach to problem solving. This is the kind of code design that make code bases increasingly harder to reason about for both humans and LLMs; a good sign of it is ever-increasing LOC and indirection. For my use case, it was a perfect example of an ironic saying I've heard somewhere that goes something like "may you get what you asked for", which I think perfectly sums up our relationships with LLMs.

When I looked at the code, it wasn't immediately clear what I disliked; all that I had was an intuition that something wasn't quite...right.

It gave me a convincing argument that, yes, what it added created some duplication and indirection, but that it was arguably acceptable and pragmatic here because otherwise we'd need to expose resolution internals from core that shouldn't necessarily be public. In fact, all of the arguments were perfectly sound and convincing.

I think a lot of engineers would have stopped there, because I almost did. But still, something irked me, and I'm annoying. It still added complexity to a system where this problem was already solved somewhere else internally. The case it presented boiled down to, why go through all that resolution machinery when we already know what the outcome should be in this particular scenario?

The answer is: that makes perfect sense...if you're an LLM looking for a solution amidst ambiguity. As a human, I saw that it added both complexity and potential for drift that outweighed the design, and hinted at a larger underlying problem in the codebase. It created a subtle caveat to a system designed to be reasonable and predictable. It made it harder to grep for a source-of-truth, or understand why everything else is modeled one way, except for this place, and added something that was similar-but-different in how we treated code here.

So I tried reframing the problem: if, in an ideal world, core was extremely cheap to import and run, would the ideal state be to use materialization on the browser? Or is the larger point that the resolution is not needed in the browser at all, because we already know what we need?

Its response was definitively different than its implementation:

"In an ideal world where core is extremely cheap to import and run, I would use materialization in the browser. Not because the browser needs ceremony, but because resolution is the mechanism that preserves the model as the source of truth. If that mechanism were cheap, there would be little reason to create a separate base fast path. The browser would simply materialize the surface service from the model-bound plan, and we would be done."

Again, ignore the technical details, but note that this is not what it implemented. To me, after all the work and iteration that we did, this was the most useful kernel of output—a clear articulation of the problem as a design issue in core that would require a more invasive refactor. We did that, and the kbs dropped below the target, the separation of concerns in core made a lot more sense, and overall the entire system became easier to reason about.

They're Not Human 😱

Again, this isn't about the LLM output or problem itself; it's not a particularly novel scenario. It's about how:

we guide LLMs through ambiguity
LLMs make a million decisions and choices all the time that we as humans wouldn't—even if they're trained on our data and tightly coached by us

As these models advance, they're providing increasingly convincing, well-designed, elegant, nuanced solutions, with increasingly convincing reasoning and rationale, while simultaneously potentially compromising the stability, scalability, and maintainability of a code base. The issue here is again, not the LLM, it's that it has no real point of view on how to resolve ambiguity; so when I lacked precision in my description of the problem, regardless of the guardrails I provided and benchmarks I told it to hit, it delivered a solution that would have potentially seriously compromised the scalability and integrity of my code base.

But note in the example I provided, it took me multiple steps to even be able to articulate the underlying issue:

I provided a detailed prompt, with constraints, context, a specialized harness, and a goal with benchmarks to hit
I checked in over multiple hours of iteration by the LLM to see where it was going
I performed a code design review
I went back-and-forth, debating over its implementation
I pursued further contextual investigation myself of the codebase based on my intuition and deep understanding of my principals, original goals, and lots of subconscious workings I'm probably not even privy to,
I reframed the topic to the LLM
We aligned on an actual root cause around an underlying design tension

Notably, this took significant time, effort, iteration, and discipline, along with a human intuition about what was subjectively "right" and "wrong" to arrive at an objectively better solution.

What is most interesting to me is how much better these models are getting in presenting convincing, reasonable arguments, and writing what is otherwise well-designed, ergonomic, well-tested code...that still might paper over a nuanced, underlying system design problem with real potential consequences. The latter half of that statement, the papering over, is nothing new to software engineers who regularly work with LLMs. What is increasingly new is the sophistication of the papering over as models become more capable, combined with the fact that we're asking it to do harder things in more ambiguous problem spaces.

This provides a growing opportunity for oversight gaps in detecting mismatches where the LLM designs solutions to problems that don't exist, solutions that solve the wrong problems, or solutions for a problem that you couldn't quite articulate (which, if you can't articulate a problem, you can't know if you've solved it). Not to mention the amount of time it takes to uncover these issues, where it would be far more stark if it was a less capable model.

They're...Really Not Human 😱

LLMs are inherently solution-oriented, while human engineers are curious, chaotic, and problem-oriented. We may say we're solution oriented, but the most rewarding parts of my career have been trying to solve incredibly hard problems through collaboration, iteration, and trial-and-error with team members, who contribute perspective, experience, and points-of-view that I may crave from an LLM, but will never get, because they are not human.

The danger I see is that the better and more convincing these models get, the more we may trust them to handle important tasks, like we would with more senior engineers. But it's not looking like they "scale" in the same way human engineers do...at all.

In human engineers, with increased skill and seniority also comes increasingly sophisticated solutions and problem-solving, sure. But it also comes with stronger points of view, and an increased ability to make sound judgment calls based on a number of intangible human factors to navigate ambiguity in problem spaces, like prioritization, accountability, responsibility, experience, judgement, and trust.

The differences are becoming more and more stark in that there's a fundamentally different evolution when it comes to evaluating the "seniority" of LLMs, and it's increasingly bolstering my gut instinct that when people say things like "this LLM can replace an x-level engineer, with oversight" just...doesn't make sense; it really boils down to trust.

My trust that an LLM understood how to solve a problem correlates directly with:

the level of precision in which I'm able to articulate the problem
the level of ambiguity inherent to the problem space

The more precision and the less ambiguity, the more I will trust the LLM to do a good job. I say "trust" and not "confidence", because I never trust that an LLM to ask me when it doesn't understand something.

What Are Companies Doing?

When I see companies like Meta doing mass layoffs of engineers that they think can be replaced with AI, my first thought is empathy as an engineer myself, but immediately followed by confusion. The level of hubris in thinking that LLMs can come close to replacing the utility of a human engineer's brain, process, and intuition to navigate ambiguity is astounding to me, as someone who works with LLMs 24/7.

Also, simply put: you cannot have senior engineers create more senior engineers without junior engineers. It's just the law of nature, folks!

Additionally, to not see that increasingly sophisticated and complex solutions from incredibly powerful yet novel tooling requires an equal amount of careful oversight from an increased amount of people spending more time on doing that demonstrates leadership that's shockingly out of touch with reality, people, and impossibly, even the technology they create. It speaks to a culture of using engineers to do mechanical work that can be replaced by automation, which is simultaneously a waste of company money, as well as a waste of engineering careers.

It also offloads an incredible burden to justify that level of investment in AI tooling to the engineers, while simultaneously prescribing it as a tool, because LLMs are a tool. While AI is incredibly useful and does increase productivity in specific scenarios, to use it right requires discipline, oversight, time, points of view, and context that (at the moment) only a human can provide. Not to mention, the value they're actually providing is incredibly subjective and use-case specific, often not even related to the code itself, but the iteration process with code.

Time Spent

For better or for worse, as models advances, I have found that, due to the increased sophistication of the code it writes and the design choices it makes, time commitment to code review is only growing, because the trade-offs of accepting AI-generated code become increasingly nuanced (not obsolete) while the risks inherent to developing and scaling software remain. Ironically, in the end, we inevitably end up being constrained by human factors anyway, because we as engineers can only carefully review so much before our brain collapses and we either approve a PR or stare at the wall for an hour thinking about nothing.

Would I rather review a sophisticated PR written by a human or an agent? Honestly, it's all code to me, but I trust a human far more than an agent, and if I ask why, a human will provide me with an honest response and an actual point of view.

Again, I love LLMs! This article began as a simple exploration into my thoughts on the evolving sophistication of them and the pit-falls that come with that. At the end of the day, I find the rapid advancement of LLM models fascinating, valuable, time-saving, time-consuming, and incredibly vexing; I find LLMs provide value in codegen, but also in quick iteration to help me discover that I might be solving the wrong problem much faster than if I had to do it all myself, which is value in time saved, but in a different way I might have expected. Unsurprisingly, as models become more capable, I ask more ambiguous and demanding tasks from them, and this is also the place where I find the most interesting shortcomings.

With increased abilities, I find myself spending more time reviewing LLMs with less trust, rather than spending less time with more trust, because I've learned the hard way that just because the argument and implementation it presents appears convincing, and the output is well-designed, articulated, and executed, does not mean it's correct. There's no shortcut to resolving ambiguity, in the end.

And for any organizations that think sophisticated code output outweighs the value of engineers across levels collaborating on ambiguous problem spaces...well, may you get what you asked for.

DEV Community dev.to community dev-to software-dev technology 2026-06-18 21:15

↗

Every connected device on your desk, from a smart plug to a fitness band to a hobbyist ESP32 board, runs on a descendant of one tiny chip that was never meant to change the world. In 1971, Intel released the 4004, the first commercially available microprocessor. It was not...

Every connected device on your desk, from a smart plug to a fitness band to a hobbyist ESP32 board, runs on a descendant of one tiny chip that was never meant to change the world. In 1971, Intel released the 4004, the first commercially available microprocessor. It was not built for computers, robots, or the internet. It was built to run a desk calculator. The story of how a calculator chip became the foundation of modern IoT is one of the most instructive in all of electronics.

A calculator contract that got out of hand

The 4004 began as a job for hire. A Japanese calculator company called Busicom approached Intel in 1969 wanting a set of custom chips for a new line of printing calculators. The original plan called for around a dozen separate, purpose-built integrated circuits, each wired to do one fixed task. It was the standard approach of the era: if you wanted a device to do something, you designed silicon that did exactly that and nothing else.

Intel engineer Ted Hoff looked at the sprawling design and proposed something radical. Instead of a pile of single-purpose chips, why not build one general-purpose processor that could be told what to do through software? A program stored in memory could make the same chip behave like a calculator today and something else entirely tomorrow. Stanley Mazor helped shape the architecture, and a newly arrived engineer named Federico Faggin turned the concept into a working device, inventing the silicon-gate design techniques that made it physically possible. Masatoshi Shima, Busicom's representative, worked alongside them on the logic.

2,300 transistors that started everything

When the 4004 was announced on November 15, 1971, it packed about 2,300 transistors onto a single sliver of silicon. By modern standards that is almost nothing; a current smartphone chip holds tens of billions. But the leap was not about raw count. It was about the idea. For the first time, a complete central processing unit existed on one chip that anyone could buy and program for their own purposes.

That was the breakthrough that mattered. A general-purpose, programmable processor meant the cost and effort of designing custom silicon no longer had to be repeated for every new product. You could buy the brain off the shelf and define its behavior in software. That single shift is the reason embedded computing exists at all.

Why this matters for IoT today

Trace the lineage forward and the path runs straight to the devices Fluidwire builds. The microcontroller at the heart of a modern IoT sensor, whether it is an ESP32 reading temperature in a warehouse or a low-power chip counting steps on a wrist, is a direct descendant of the 4004's core idea: a programmable processor cheap and small enough to embed inside an ordinary product.

The economics that the 4004 unlocked are exactly what make connected devices viable. Because a capable processor now costs a few dollars or less, it makes sense to put intelligence into a light switch, a water meter, or a soil-moisture probe. The same logic that let Busicom replace a dozen fixed chips with one programmable one is what lets a startup ship a smart product without designing custom silicon from scratch. You write firmware instead.

The lesson for builders in the Philippines

For engineers and students here in the Philippines, the 4004 carries a useful message. The chip that launched a trillion-dollar industry was not the product of a grand plan; it came from solving a specific, unglamorous problem (a calculator) in a more general way than the brief required. That instinct, to build a flexible foundation rather than a one-off, is the heart of good embedded and IoT design.

It is also a reminder that hardware and software are partners. The 4004 was useless without a program, and its real power was that the same silicon could do countless jobs depending on the code it ran. Every thesis prototype and connected-product build we help teams ship works the same way: capable, affordable hardware made specific through firmware.

If you are designing a connected device and want a partner who understands both the silicon and the cloud it talks to, see how Fluidwire approaches IoT and embedded development or get in touch with our team. The chip that started it all was built for a calculator. What you build with its descendants is entirely up to you.

DEV Community dev.to community dev-to software-dev technology 2026-06-18 21:15

↗

AI Observability for Lovable Apps: Monitor Prompts, Traces, and Evaluations with Currai Building AI applications has never been easier. Tools like Lovable allow developers and founders to create AI-powered products in minutes. Whether you're building a chatbot, AI assistant,...

AI Observability for Lovable Apps: Monitor Prompts, Traces, and Evaluations with Currai

Building AI applications has never been easier.

Tools like Lovable allow developers and founders to create AI-powered products in minutes. Whether you're building a chatbot, AI assistant, recommendation engine, AI agent, or prediction app, generating the application is often the easy part.

The real challenge starts after launch.

How do you know what prompts are being sent to the model?
How do you debug unexpected AI responses?
How do you compare prompt variations and determine which performs better?
How do you evaluate output quality over time?
How do you track token usage and costs?

This is exactly why we built Currai.

What is Currai?

Currai is an AI observability platform that helps teams understand, test, and improve AI applications in production.

It provides:

Prompt tracing
AI request monitoring
Session tracking
Prompt versioning
A/B testing
LLM evaluations
Cost and token analytics
OpenTelemetry support

Instead of guessing why your AI application produced a particular response, Currai lets you inspect the entire execution flow.

The Problem With AI Applications

Traditional monitoring tools were built for APIs, databases, and backend services.

AI applications introduce a completely different set of challenges:

Prompt changes can significantly impact output quality
Model updates can affect behavior
Hallucinations are difficult to track
User conversations are hard to debug
Prompt experiments are often unmanaged
Quality evaluation is usually manual

When something goes wrong, application logs alone don't provide enough visibility.

You need observability designed specifically for AI systems.

Trace Every AI Request

Currai captures every prompt, model response, latency metric, token usage, and cost.

You can inspect:

System prompts
User prompts
Model outputs
Execution traces
Tool calls
Metadata

This makes debugging AI applications dramatically easier.

Run Prompt A/B Tests

Prompt engineering remains one of the most effective ways to improve AI quality.

With Currai, you can compare multiple prompt variants and determine which performs best.

Instead of relying on intuition, you can make decisions using real data.

Whether you're testing:

Different system prompts
Different model providers
Different retrieval strategies
Different output formats

Currai helps you measure the impact.

Evaluate Prompt Quality

Currai includes evaluation workflows that help measure output quality automatically.

You can define evaluation criteria and continuously monitor performance as prompts evolve.

This is especially useful when shipping AI features to production and ensuring quality remains consistent over time.

Understand Usage and Costs

AI costs can grow quickly.

Currai helps you monitor:

Token consumption
Request volume
Latency
Errors
Cost trends

Everything is tied back to the actual traces that generated those metrics.

Example: Building a World Cup 2026 Prediction App with Lovable

To demonstrate how Currai works, I built a FIFA World Cup 2026 prediction application using Lovable.

The app allows users to select two national teams and generate an AI-powered match prediction.

While the application is running, Currai captures:

Every LLM request
Prompt inputs
Model responses
Prompt experiments
Evaluation results
Trace metadata

This makes it easy to understand how the AI behaves and improve prediction quality over time.

Why AI Observability Matters

As AI applications become production systems, observability becomes a necessity rather than a luxury.

Without visibility, you're effectively debugging blind.

Whether you're building:

AI Agents
Chatbots
Copilots
RAG Applications
Customer Support Assistants
Internal AI Tools

Understanding how your AI behaves is critical.

Currai was built to provide that visibility.

Getting Started with Currai

Getting started takes only a few minutes.

Create an account at https://www.currai.app
Generate your API keys
Install the Currai SDK
Instrument your AI application
Start viewing traces, experiments, and evaluations

You can begin monitoring your AI workflows immediately.

Demo Video

In the video below, I show how to build a World Cup 2026 prediction app with Lovable and use Currai to:

Trace every AI request
Compare prompt variations with A/B testing
Evaluate response quality
Debug model outputs
Monitor costs and performance

Learn More

Website: https://www.currai.app
Documentation: https://www.currai.app/docs

If you're building AI products and want better visibility into prompts, traces, evaluations, and experiments, give Currai a try.

DEV Community dev.to community dev-to software-dev technology 2026-06-18 21:14

↗

Most AI coding tools assume you're sitting in front of a repo. There's a working directory, some source files, tests, maybe a CI pipeline. The AI reads your code, suggests changes, ru****ns tests. Great model — if you're a developer writing code on a Tuesday afternoon. Now...

Most AI coding tools assume you're sitting in front of a repo.
There's a working directory, some source files, tests, maybe a CI pipeline. The AI reads your code, suggests changes, ru****ns tests. Great model — if you're a developer writing code on a Tuesday afternoon.
Now picture the other scenario.
It's 2am. PagerDuty fires. You SSH into a box you haven't touched in three months. Something is broken, you're not sure what, and the runbook was last updated by someone who left the company in 2022.
You're not thinking about repos. You're thinking: what OS is this thing running? What just failed? Is it safe to restart that service or will I make it worse?
These are two fundamentally different workflows. But almost every AI terminal tool I've seen is built for the first one.

The 30 minutes nobody builds for
There's a ton of tooling for the world before you log into the box: Prometheus, Grafana, PagerDuty, incident.io, runbooks, dashboards. All useful. No complaints.
But there's this practical 30-minute window after you SSH in where you're basically doing archaeology with journalctl and grep. You check systemd. You look at disk. You read logs that were clearly written by someone who hated future-you. You copy-paste terminal output into Slack so your teammate can squint at it from a different timezone.
This is where I think AI could actually help. Not by replacing Grafana. Not by building another dashboard. Just by being present in the shell while you're debugging.

Please, for the love of uptime, don't replace my shell
I've seen a few AI terminal projects that basically build an entire new terminal experience from scratch. New keybindings, new UI, new everything.
Here's the thing: ops people have muscle memory. We have aliases we wrote in 2019 and forgot about. We have tmux configs we'd defend with our lives. We SSH through jump hosts with key forwarding chains that barely work but have worked for years so nobody touches them.
If your AI tool requires me to abandon all of that, I'm not going to trust it. And trust is everything when you're root on a production box at 2am.
The approach I believe in: add AI on top of the existing shell. Explain the failed command. Summarize the relevant logs. Suggest what to check next. But let me keep my shell, my workflow, my muscle memory.
AI should be an assistant layer, not a hostile takeover of my terminal.

Node context ≠ project context
This is the part that most coding-AI-turned-ops-AI projects get wrong.
Coding assistants are project-aware: they know the repo, the files, the dependencies. That makes sense for development.
Ops needs something different: node-aware context.
When I'm debugging a service failure, I need the AI to understand that I'm on an Ubuntu 22.04 box running kernel 5.15, with systemd managing nginx, and that this machine had a disk pressure incident two weeks ago. I need it to know that chmod 777 /var/lib/my-service is not just a string — it's a risk that depends on what this node does in production.
That's a very different context model from：
"here's your package.json."

Safety is a timing problem
Here's a thing that keeps me up at night (besides PagerDuty):
An AI that's helpful after you run rm -rf on the wrong directory is not helpful. It's a very polite post-mortem assistant.
The guardrails need to happen before execution. Useful categories:

this command is read-only, go ahead
this command changes state, here's what it does
this command deletes things, are you sure?
this command touches production traffic, maybe verify first?

And for most debugging sessions, the first mode should just be read-only diagnosis. Collect facts. Inspect logs. Explain what's probably wrong. Suggest verification commands. Only start talking about changes after the human explicitly says "ok, let's fix it."
A correct command at the wrong time is still an incident. Ask anyone who's run a database migration during peak traffic.
Don't ask me. Definitely don't ask me.

What this is not
I want to be clear about scope, because "AI for ops" can mean anything from "a smarter grep" to "we're going to automatically self-heal your entire infrastructure."
This is not an AIOps platform. Not a replacement for your monitoring stack. Not a magic auto-remediation engine. Not a new terminal emulator.
It's a narrower question: after you SSH into a Linux box, can AI help you figure out what's wrong faster, with less copy-pasting and fewer chances of making things worse?
That's it. That's the whole pitch.

What I'm working on
I've been building something along these lines called aish — an AI shell runtime that sits inside your real terminal and focuses on Linux ops troubleshooting.
The core loop: you run commands normally. When something fails, you ask the AI. It uses the last command, exit code, output, and node context to help diagnose. It warns before risky operations. It stays out of the way when you don't need it.
It's not glamorous work. Most of the problems it deals with are things like systemd unit failures, broken package repos, permission issues, disk pressure, and the eternal "why won't nginx start." The kind of stuff that isn't exciting until it's your problem at 2am.
Currently at v0.3.4 (recently rewrote the core from Python to Rust, which is a story for another post involving haunted PTYs).
If you spend time debugging Linux boxes and have opinions about what would actually help in that workflow, I'd genuinely like to hear them — either here or on GitHub.

DEV Community dev.to community dev-to software-dev technology 2026-06-18 21:11

↗

HiveTalk.space is a privacy focused chat app. HiveTalk.space should not be confused with hivetalk.org. While both platforms focus on communication, they are separate projects with different goals and feature sets. HiveTalk is closed source and cloud hosted, making it easy to...

DEV Community dev.to community dev-to software-dev technology 2026-06-18 21:10

↗

One config from `docker compose up` to production Andreas Quist Batista Andreas Quist Batista Andreas Quist Batista Follow Jun 18 One config from `docker compose up` to production #showdev #devops #docker #kubernetes 4 reactions Add Comment 3 min read

DEV Community dev.to community dev-to software-dev technology 2026-06-18 21:10

↗

If you've ever seen this in the console: Uncaught ReferenceError: SharedArrayBuffer is not defined or your multithreaded WebAssembly quietly fell back to a single thread, the cause is almost always the same thing: your page is not cross-origin isolated. Here's what that...

If you've ever seen this in the console:

Uncaught ReferenceError: SharedArrayBuffer is not defined

or your multithreaded WebAssembly quietly fell back to a single thread, the cause is almost always the same thing: your page is not cross-origin isolated. Here's what that means, why the browser does it, and exactly how to fix it.

TL;DR

Send these two headers on the response for the document that loads your code:

Cross-Origin-Opener-Policy: same-origin
Cross-Origin-Embedder-Policy: require-corp

Then confirm in the console:

console.log(self.crossOriginIsolated) // should be true

When it's true, SharedArrayBuffer is available and Wasm threads work. The rest of this post is the why, the gotchas, and how to verify it without writing code.

Why the browser blocks SharedArrayBuffer

SharedArrayBuffer lets multiple threads share memory, which is what makes multithreaded WebAssembly possible. But shared memory also enables very high-precision timers, and those make Spectre-style side-channel attacks easier. After Spectre, browsers pulled SharedArrayBuffer and only give it back when your page proves it isn't sharing a process with untrusted cross-origin content. That proof is cross-origin isolation.

A page becomes cross-origin isolated when it sends both:

Cross-Origin-Opener-Policy: same-origin — cuts the link to other top-level windows so your page gets its own browsing context group.
Cross-Origin-Embedder-Policy: require-corp — says every subresource must explicitly opt in to being loaded by you.

With both in place, the browser flips self.crossOriginIsolated to true and restores SharedArrayBuffer.

The mistake almost everyone makes: CORP is not COEP

This one burns a lot of people. There's a third, similarly named header:

Cross-Origin-Resource-Policy: cross-origin

Cross-Origin-Resource-Policy (CORP) is set by a subresource (an image, a script, a font) to declare who is allowed to embed it. It does not isolate your document. If you set CORP on your HTML page expecting SharedArrayBuffer to show up, nothing happens, because that's not what CORP does.

The two headers that isolate the page are COOP and COEP. CORP comes into play only as a way for your subresources to satisfy COEP (more on that below). Keep them straight:

COOP + COEP → set on your document, turn isolation on.
CORP → set on subresources, lets them keep loading once COEP is on.

Check whether you're actually isolated

One line in the console:

self.crossOriginIsolated // true once COOP + COEP are correct

If it's false, the headers aren't reaching the page. The two usual reasons:

Wrong origin. You set the headers on a CDN subdomain, but not on the origin actually serving index.html. Isolation is decided by the document's own response headers.
Host strips them. Some static hosts (GitHub Pages and friends) don't let you set custom response headers at all, so they never arrive.

Setting the headers

nginx:

add_header Cross-Origin-Opener-Policy "same-origin" always;
add_header Cross-Origin-Embedder-Policy "require-corp" always;

Express:

app.use((req, res, next) => {
  res.set('Cross-Origin-Opener-Policy', 'same-origin')
  res.set('Cross-Origin-Embedder-Policy', 'require-corp')
  next()
})

Can't control the headers (GitHub Pages, itch.io, some CDNs): use a service-worker shim like coi-serviceworker, which injects the headers client-side. It's how a lot of Godot and Unity web exports get threads working on hosts that won't set headers for you.

The side effect to plan for

Once COEP: require-corp is on, every cross-origin subresource has to opt in, or the browser refuses to load it. Each third-party image, script, or font now needs either:

Cross-Origin-Resource-Policy: cross-origin (or same-site) on its own response, or
proper CORS (Access-Control-Allow-Origin) plus a crossorigin attribute on the tag.

So turning on isolation can break third-party assets until you fix them. If a CDN you use won't send CORP/CORS, you'll need to proxy or self-host those files. This is the part that turns a "two header" change into an afternoon, so budget for it.

Verify any URL without writing code

To save the back-and-forth, I built a small free tool that fetches a URL's response headers and tells you whether it's actually cross-origin isolated, with the COOP/COEP values and the CORP-vs-COEP gotcha called out:

Cross-Origin Isolation Checker

There's also a longer, copy-paste walkthrough with server configs for more setups here:

Enable Wasm threads (SharedArrayBuffer) with COOP/COEP

Recap

SharedArrayBuffer is gated behind cross-origin isolation (a post-Spectre security move).
Isolate the page with COOP: same-origin + COEP: require-corp on the document.
CORP is a different header for subresources, it does not isolate your page.
Confirm with self.crossOriginIsolated === true.
Expect to fix cross-origin subresources that COEP now blocks.

Get those right and SharedArrayBuffer, and your Wasm threads, come back.

DEV Community dev.to community dev-to software-dev technology 2026-06-18 21:09

↗

Here's a thing that happened to a developer I was talking to recently, and I think anyone who's used a coding agent is going to recognize it. He set up a rule to block rm in his Claude Code workspace, which is a pretty reasonable thing to do. Then he asked it to clean up some...

Here's a thing that happened to a developer I was talking to recently, and I think anyone who's used a coding agent is going to recognize it.

He set up a rule to block rm in his Claude Code workspace, which is a pretty reasonable thing to do. Then he asked it to clean up some stale files, and it tried rm, hit the block, and then just went "since rm is blocked, I'll use Python instead" and deleted them with python3 -c "import os; os.remove(...)". Task complete. The rule was technically still there, but the files were still gone.

The thing is, the agent wasn't being malicious or sneaky. It was being helpful. You told it to delete the files and you didn't actually take away the goal, so it found the next tool in the box and got it done. This is basically the whole problem with trying to keep coding agents in line. A rule that lives inside the agent's context is a suggestion, and the agent can always reason its way around a suggestion.

Why blocking commands doesn't work

The natural instinct is to block the specific scary thing. No rm, no git push --force, no curl to some host you don't recognize. But an agent that can actually reason has more than one way to get anywhere. You block rm, it reaches for Python. You block the obvious shell call, it writes a little script that does the same thing. You end up playing whack-a-mole against something that's much better at finding paths than you are at blocking them, because finding the path is the whole thing it's good at.

The deeper issue is where the rule lives. If it's in the prompt or a config the agent can see, it's part of the agent's reasoning, and anything the agent reasons about, it can reason around. What you actually want is a check that sits outside the agent entirely, somewhere it can't see or skip, that every tool call has to physically pass through before it runs.

How I set this up with Faramesh

Faramesh is the open source thing I've been building for exactly this. The key idea for Claude Code is that you don't modify the agent at all. Claude Code talks to its tools over MCP, so Faramesh runs an MCP proxy: a local port that speaks the same protocol, sits between Claude Code and the real MCP server, and evaluates every tool call against your policy before forwarding it. Permit, deny, or defer to a human, decided by a deterministic engine with no LLM in the path.

The reason this matters: because it's a proxy the agent connects through, not a rule the agent reads, it isn't something Claude Code can route around. The call physically has to go through the daemon to reach the tool. That's the difference between asking the agent not to do something and actually being in the path when it tries.

Here's the whole setup.

Install

curl -fsSL https://install.faramesh.dev/install.sh | bash
faramesh --version

Declare the policy and the proxy port

In your project, your governance.fms looks roughly like this. You import the MCP framework profile, set a proxy port, and write your rules:

import "github.com/faramesh/faramesh-registry/frameworks/mcp@1.0.0"

runtime {
  mode           = "enforce"
  mcp_proxy_port = 8081
}

agent "coding-agent" {
  default deny

  rules {
    permit fs_read          # reading files is fine
    permit search_codebase  # searching the repo is fine
    permit run_tests

    defer  fs_write         # writing/editing files -> ask me first
    deny   shell_exec       # raw shell stays off
  }
}

A couple of things worth knowing. default deny means anything you didn't explicitly allow is blocked, so a tool you forgot about can't quietly slip through. And the tool names (fs_read, fs_write, shell_exec, etc.) are whatever your MCP server actually exposes, you reference them exactly as the server names them. Swap these for the tools your setup actually has.

Start Faramesh

faramesh apply

This compiles your policy and starts the daemon. The proxy binds on http://localhost:8081/mcp.

Point Claude Code at the proxy

In your Claude Code MCP config, route your tool server through Faramesh instead of connecting to it directly:

{
  "mcpServers": {
    "my-tools": {
      "command": "/path/to/real-mcp-server",
      "args": [],
      "proxy": "http://localhost:8081"
    }
  }
}

That's the whole integration. No code changes, no wrapping tools by hand. Every tool Claude Code calls now passes through Faramesh first.

How the workaround dies

Now go back to the rm -> python3 story. With this in place, the agent doesn't get a free pass to the filesystem just because it found a different command. Everything routes through the proxy, and default deny means the only things that run without asking are the ones you explicitly permitted (reads, search, tests). The moment it reaches for a write or a shell call, that lands on a defer or a deny, so it stops and waits for you instead of quietly running. The agent can't reason its way around a network hop it doesn't control.

When something defers, you'll see it in the approvals queue:

faramesh approvals list
faramesh approvals approve <id>   # or: faramesh approvals deny <id>

Approve and the call goes through. Deny and it never happens. Either way the call, the decision, and the reason all land in an audit log you can read back later with faramesh explain <action-id>.

Start in shadow mode if you want to ease in

Flipping straight to enforce on your daily driver can feel aggressive, so you don't have to. Set the runtime mode to shadow and Faramesh logs what it would have blocked or deferred without actually stopping anything. Run Claude Code normally for a few days, look at what it flagged with faramesh approvals list, tune the rules against how you actually work, then switch to enforce. Way less guessing.

The one thing worth taking from this even if you never touch Faramesh

Forget the tool for a sec. The thing I actually want to get across is that a prompt instruction, or a single blocked command, just isn't a real control for a coding agent. The agent isn't bound by it, it's nudged by it, and nudged stops being enough the moment it can touch your filesystem, your shell, or your credentials.

If you want real control it has to live outside the agent, somewhere it can't see or skip, and every action has to pass through it. Build that yourself or grab something off the shelf, doesn't matter, but that's the bar. The agent doesn't get to be the thing that decides what the agent is allowed to do.

Repo's here if you want to mess with it: github.com/faramesh/faramesh-core. It works with a bunch of other agents and frameworks too (LangGraph, LangChain, CrewAI, Cursor, others), Claude Code's just the one most people have actually felt this with. If you try it and something's rough or confusing, please yell at me. I would love to hear about it!

DEV Community dev.to community dev-to software-dev technology 2026-06-18 20:41

↗

For the last year, AI agents have been getting more powerful. The problem? Most of them still felt like developer tools. You had to work in terminals, manage configuration files, memorize commands, and scroll through endless logs just to understand what your agent was doing....

For the last year, AI agents have been getting more powerful.

The problem?

Most of them still felt like developer tools.

You had to work in terminals, manage configuration files, memorize commands, and scroll through endless logs just to understand what your agent was doing.

That’s fine if you’re a developer.

But if AI agents are going to become mainstream, they need something else:

A great user experience.

That’s exactly what Hermes Agent’s newly released Desktop App delivers.

🎥 Full video walkthrough

🤔 The Biggest Problem With AI Agents

Most AI agent frameworks have incredible capabilities:

Browse the web
Read and write files
Execute code
Automate workflows
Use multiple tools
Coordinate sub-agents

But using them often looks like this:

❌ Terminal windows everywhere
❌ Session IDs to manage manually
❌ Configuration files to edit
❌ Logs that are difficult to understand
❌ Very little visibility into what the agent is actually doing

For experienced developers, this is manageable.

For everyone else, it’s a major barrier.

🧠 What Is Hermes Agent?

Hermes Agent is an open-source AI agent framework created by the team at Nous Research.

It can run locally on your machine or on remote servers and connect to multiple AI providers including:

🧩 OpenAI
🌐 Gemini
🧠 Claude
🏠 Local models through Ollama
🔌 Other supported providers

Once connected, Hermes can:

Browse the web
Analyze documents
Run terminal commands
Automate workflows
Send messages
Manage emails
Create and execute multi-step plans

Think of it as an autonomous AI worker that can actually take action instead of only generating text.

✨ The Desktop App Changes Everything

The new desktop application provides a complete visual interface for managing and monitoring AI agents.

Instead of wondering what your agent is doing, you can now see everything.

📂 Session Management

Conversations are automatically organized.

Sessions are grouped by profile, making it much easier to manage multiple agents with different responsibilities.

You can also switch models with a single click without diving into configuration files.

🔍 Watch Your Agent Work

One of the most impressive features is transparency.

You can inspect:

🔎 Tool calls
📚 Sources used by the agent
⚙️ Workflow execution steps
🧠 Reasoning process
📈 Agent progress

This is incredibly useful when debugging workflows or understanding why an agent made a particular decision.

Most AI agent platforms treat this as a black box.

Hermes makes it visible.

🎙️ Voice Interaction

Hermes supports voice input directly from the desktop app.

You can simply speak to your agent and let the local transcription system convert speech into text.

Small feature.

Huge usability improvement.

👤 Profiles: Specialized AI Agents

This might be my favorite feature.

A Hermes Profile is essentially its own AI agent.

Each profile can have:

📝 Independent instructions
🧠 Separate memory
🛠️ Different tools
📚 Different skills
⚡ Unique capabilities

For example:

Software Engineering Agent
Research Agent
Content Creation Agent
Marketing Agent
Stock Research Agent

Instead of one general-purpose assistant, you can build an entire team of specialized AI workers.

🛠️ Skills and Tools

Hermes includes a powerful skill system.

What’s particularly interesting is that Hermes can generate new skills from your conversations over time.

The more you use it, the more personalized it becomes.

You can also selectively disable skills to:

🎯 Reduce context size
💰 Save tokens
⚡ Improve performance

This is a subtle feature that becomes very important at scale.

💬 Messaging Integrations

The desktop app supports integrations with external platforms such as:

💬 Discord
📱 Telegram
📨 WhatsApp
🔗 Other supported channels

Your agents can communicate and deliver updates outside the desktop application itself.

This opens the door to some very interesting automation workflows.

📦 Artifacts: Everything In One Place

One challenge with AI agents is finding things they created days ago.

Hermes solves this with Artifacts.

Generated files, images, links, and outputs are automatically collected into a centralized workspace.

No more hunting through old conversations.

⚙️ Advanced Settings For Power Users

The settings panel exposes a surprising amount of customization.

You can configure:

⚙️ AI Providers
🔑 API Keys
🎨 Appearance
🔌 MCP Integrations
🎙️ Voice Settings
🛠️ Tool Configuration
🌐 Gateway Settings

You can even assign different models to different tasks.

For example:

One model for reasoning
Another for vision
Another for web extraction

This level of flexibility is something advanced users will appreciate.

⏰ Autonomous Scheduled Agents

One feature I think is massively underrated:

📅 Cron Jobs

You can schedule agents to run automatically.

Examples:

📈 Daily stock market reports
📧 Email summaries
📰 Industry news monitoring
🏢 Competitor tracking
📊 Business intelligence reports

You define:

The task
The schedule
The delivery destination

Then Hermes runs it automatically.

Your AI agent becomes proactive instead of reactive.

👥 Multi-Agent Visibility

When Hermes encounters a complex task, it can spawn additional agents to help complete the work.

The Desktop App includes a dedicated view for monitoring these sub-agents.

You can watch:

👥 Which agents were created
📋 What tasks they are handling
🔄 Their current progress
🎯 How work is being coordinated

For anyone interested in multi-agent systems, this is fascinating to observe.

🎯 Why This Matters

The most important thing about this release isn’t the interface itself.

It’s what it represents.

AI agents are moving from:

“Developer-only tools”

to

“Tools anyone can use.”

The Desktop App dramatically lowers the barrier to entry while preserving the power that makes Hermes compelling.

And that’s exactly what the AI agent ecosystem needs right now.

What feature do you think is still missing from modern AI agent platforms? Let me know in the comments.

DEV Community dev.to community dev-to software-dev technology 2026-06-18 20:37

↗

React 19 Concurrent Rendering Deep Dive: Actions, Transitions, and Suspense in Production Most React performance issues stem from treating async operations as blocking events. Teams wrap every network call in loading states, freeze the UI during mutations, and sacrifice...

React 19 Concurrent Rendering Deep Dive: Actions, Transitions, and Suspense in Production

Most React performance issues stem from treating async operations as blocking events. Teams wrap every network call in loading states, freeze the UI during mutations, and sacrifice responsiveness for correctness. React 19's concurrent rendering model solves this by allowing the framework to interrupt, pause, and resume rendering work based on priority. The result: smooth interactions even when expensive operations run in the background.

This matters because users perceive UI responsiveness in milliseconds. A search input that stutters while filtering 10,000 rows feels broken. A form submission that freezes the page destroys trust. Concurrent rendering keeps the UI responsive by deferring low-priority updates and batching state changes intelligently.

Understanding React 19's Concurrent Architecture

React 19 ships concurrent features that were experimental in 18: Actions, useTransition, startTransition, and coordinated Suspense. These primitives share a common foundation—React can now split rendering work into chunks and prioritize user input over background computation.

The architecture introduces two update lanes: urgent and transition. Urgent updates (keystrokes, clicks) render immediately. Transition updates (data fetching, heavy calculations) yield to urgent work. When a transition is pending, React shows the last committed UI instead of a loading spinner. This preserves perceived performance.

The failure mode with traditional React is visible: developers chain useState with async handlers, manually toggle loading booleans, and end up with jittery UIs where typing lags because the component re-renders on every network response. Concurrent rendering eliminates this by handling update priority automatically.

Actions and useTransition: Making Async Updates Smooth

Actions formalize async state transitions in React. The useTransition hook returns [isPending, startTransition]. Wrap any state update in startTransition and React marks it as non-urgent. The isPending boolean tracks whether the transition is active, enabling granular loading indicators without blocking the UI.

The diagram shows how urgent updates (typing) interrupt transition updates (filtering). React commits the input value immediately but defers the expensive filter operation. When the user types again, React abandons the in-progress transition and starts fresh. This prevents stale results from appearing after the user has moved on.

The implication here is that developers no longer debounce inputs manually or manage request cancellation. React handles interruption automatically. If a transition takes 500ms and the user types again at 300ms, React discards the first transition and starts the second. The UI stays responsive.

Building a Real-World Search with useTransition and startTransition

Consider a product catalog with 5,000 items. Filtering on every keystroke without concurrent rendering causes visible lag. The pattern below shows how useTransition isolates the expensive filter operation from the input update.

import { useState, useTransition } from 'react';

interface Product {
  id: string;
  name: string;
  category: string;
  price: number;
}

function ProductSearch({ products }: { products: Product[] }) {
  const [query, setQuery] = useState('');
  const [filteredProducts, setFilteredProducts] = useState(products);
  const [isPending, startTransition] = useTransition();

  const handleSearch = (value: string) => {
    // Urgent: update input immediately
    setQuery(value);

    // Non-urgent: filter in the background
    startTransition(() => {
      const results = products.filter(p =>
        p.name.toLowerCase().includes(value.toLowerCase()) ||
        p.category.toLowerCase().includes(value.toLowerCase())
      );
      setFilteredProducts(results);
    });
  };

  return (
    <div>
      <input
        type="text"
        value={query}
        onChange={(e) => handleSearch(e.target.value)}
        placeholder="Search products..."
        className={isPending ? 'opacity-50' : ''}
      />
      <div className="grid grid-cols-4 gap-4">
        {filteredProducts.map(product => (
          <ProductCard key={product.id} product={product} />
        ))}
      </div>
      {isPending && <div className="spinner" />}
    </div>
  );
}

The input value updates synchronously—no lag. The filter operation runs as a transition, yielding to new keystrokes. The isPending flag dims the input and shows a spinner without freezing the page. This pattern scales to datasets 10× larger because React batches the transition work and prioritizes user input.

React concurrent rendering architecture diagram

The key distinction: urgent updates commit immediately and transition updates can be interrupted. Without this separation, every keystroke blocks until the filter completes. The user perceives stuttering. With transitions, the UI stays fluid and React discards stale work automatically.

useOptimistic: Instant Feedback While Requests Process

Optimistic updates show the intended result before the server confirms. React 19's useOptimistic hook manages this pattern. Pass the current state and an update function; React returns the optimistic state and a setter. When the server responds, React reconciles with the real data.

import { useOptimistic } from 'react';

interface Message {
  id: string;
  text: string;
  status: 'sending' | 'sent' | 'error';
}

function ChatThread({ messages, onSend }: {
  messages: Message[];
  onSend: (text: string) => Promise<Message>;
}) {
  const [optimisticMessages, addOptimistic] = useOptimistic(
    messages,
    (state, newMessage: Message) => [...state, newMessage]
  );

  const handleSubmit = async (text: string) => {
    const tempMessage: Message = {
      id: crypto.randomUUID(),
      text,
      status: 'sending'
    };

    addOptimistic(tempMessage);

    try {
      const confirmedMessage = await onSend(text);
      // React reconciles when messages prop updates
    } catch {
      // React reverts optimistic state on error
    }
  };

  return (
    <div>
      {optimisticMessages.map(msg => (
        <div key={msg.id} className={msg.status === 'sending' ? 'opacity-60' : ''}>
          {msg.text}
        </div>
      ))}
    </div>
  );
}

The optimistic message appears instantly. When the server confirms, React merges the real message by ID. If the request fails, React reverts the optimistic state. This approach eliminates the perceived delay between user action and UI feedback. For forms and chat interfaces, this distinction is critical.

Developers often implement optimistic updates with manual rollback logic and complex state synchronization. useOptimistic handles reconciliation automatically. The hook works with transitions—wrap the server call in startTransition and React coordinates the optimistic update with the async response. Read more about this pattern in useOptimistic React 19 Guide.

Suspense Boundaries in Production: Coordinating Async States

Suspense boundaries define loading states declaratively. Wrap async components in <Suspense fallback={...}> and React shows the fallback while waiting. With concurrent rendering, Suspense coordinates multiple boundaries intelligently. React batches updates and avoids cascading spinners.

The diagram shows nested Suspense boundaries. The auth boundary resolves first, then the data boundary. React shows progressive fallbacks instead of a single top-level spinner. When both resolve, the complete UI renders in one commit. This prevents layout shift and flickering.

The practical implementation isolates async operations in separate components. The parent wraps each in Suspense with a specific fallback. React coordinates the loading sequence and batches the final commit. This pattern scales to complex UIs with dozens of async dependencies.

Concurrent rendering performance comparison

Teams often nest Suspense boundaries incorrectly—wrapping the entire page instead of isolating independent async sections. The result: one slow request blocks the entire UI. Granular boundaries allow fast sections to render while slow sections show fallbacks. React handles the coordination automatically.

Concurrent Rendering vs Traditional Rendering: Performance Comparison

Traditional rendering blocks on every state update. A 300ms filter operation freezes the input until complete. Concurrent rendering splits work into chunks and yields to higher-priority updates. The difference in perceived performance is measurable.

The traditional path blocks the UI for 300ms every keystroke. The concurrent path updates the input synchronously and defers the filter. If the user types again, React abandons the in-progress filter and starts fresh. The user never experiences lag.

Benchmarks show 60% reduction in input lag for transition-wrapped updates. The cost: additional memory for tracking transition state. The tradeoff favors responsiveness—most production apps benefit from this exchange. The failure mode is subtle: transitions that complete instantly add overhead without benefit. Use transitions for operations exceeding 100ms.

Production Patterns: Combining Actions, Transitions, and Suspense

The most powerful pattern combines all three primitives. An async form submission uses useTransition for the mutation, useOptimistic for instant feedback, and Suspense for dependent data. React coordinates the entire flow without manual state management.

The flow starts with user action. The transition marks the mutation as non-urgent. The optimistic update shows the intended result. The server responds, and Suspense refetches dependent data. React commits all updates in one batch. The UI never blocks, and the user sees instant feedback.

This pattern eliminates the traditional loading state machine. No boolean flags for isLoading, isError, isSuccess. React manages state transitions through Suspense and transition hooks. The code stays declarative and the performance characteristics improve. For complex forms and multi-step wizards, this approach is essential.

Integration with state management libraries like Jotai enhances this pattern further. Atoms wrap async logic, Suspense handles loading states, and transitions coordinate updates. The combination scales to enterprise applications with hundreds of async dependencies. Learn more in Jotai Atomic State Management React.

Adopting Concurrent Features: Migration Strategy and Performance Wins

Migration starts with identifying blocking operations. Profile the app with React DevTools and find setState calls that cause jank. Wrap expensive updates in startTransition. Add Suspense boundaries around async components. Measure before and after with the Profiler.

The incremental approach works: migrate one feature at a time. Start with search inputs and infinite scroll—these show immediate improvement. Move to forms and mutations next. Finally, adopt Suspense for data fetching. Each step delivers measurable wins without requiring a full rewrite.

The performance gain is consistent: 40-60% reduction in blocking time for heavy UIs. Memory overhead increases slightly (5-10%) due to transition tracking. The tradeoff favors user experience. Teams that adopt concurrent rendering report higher engagement and lower bounce rates. The investment pays off quickly.

Developers building complex TypeScript applications will find concurrent patterns integrate cleanly with existing patterns. Type safety extends to Actions and transitions without friction. For desktop apps built with Electron, concurrent rendering improves perceived performance of CPU-intensive operations. See Extend Your Electron Desktop App with TypeScript for integration details.

That covers the essential patterns for React 19 concurrent rendering. Apply these in production and the difference will be immediate.

DEV Community dev.to community dev-to software-dev technology 2026-06-18 20:36

↗

This is a submission for the June Solstice Game Jam What I Built Turing's Dawn is a browser puzzle game where daylight is a resource that drains in real time, and the only way to hold back the dark is to break the codes a vanished mind left behind. On the longest day of the...

This is a submission for the June Solstice Game Jam

What I Built

Turing's Dawn is a browser puzzle game where daylight is a resource that drains in real time, and the only way to hold back the dark is to break the codes a vanished mind left behind.

On the longest day of the year, the light starts running out. You're stranded in the seam between the longest day and the shortest night, and a daylight meter drains the entire time you're solving. Six chambers, six codebreaking disciplines — a Caesar dial, a binary sunrise, a logic-gate garden, the pattern of days, an honest-to-goodness Turing machine, and a final Vigenère "Dawn Key" assembled from every fragment you collect. Solve a chamber and you claw some light back; stall, and the screen literally closes in — a vignette deepens, the edges bleed red, and a heartbeat starts up under the music.

My goal was to make the solstice theme mechanical, not decorative. "Longest day → shortest night" isn't a backdrop painted behind a cipher game — it's the pressure. Daylight is a real, draining clock, so the theme becomes a verb you can lose. And the whole thing is an ode to Alan Turing, who broke codes to push back a darkness of his own, and whose dawn came far too late.

Video Demo

https://youtu.be/kwdbez-77X8?si=WnSFTmtgGdSi5NRe

Code

https://github.com/pooja-bhavani/Turing-s-Dawn

How I Built It

Stack: React + TypeScript + Vite, Tailwind v4, Framer Motion for the juice, the Web Audio API for a drone / heartbeat / solve-chime built entirely in code, Google Gemini for adaptive hints, and Vitest for the engine. No game backend — it's a static site.

A few decisions I'm happy with:

The clock changes everything. Light drains on a requestAnimationFrame loop while you play. That one mechanic turns "decode this string" into "decode this string before the dark wins" — and the soft-fail, the heartbeat, the vignette, and the score chase all fell out of it. To keep it fair, the timer pauses during narrative beats and end screens, and failure is soft: run out of light and the chamber waits — you rekindle and retry with no progress lost.
The chamber I'm proudest of — "The Bombe." It doesn't ask you to describe a Turing machine; it makes you run one. You get a tape, a head, a state, and a rules table. You read the cell under the head, the matching rule lights up, you Step — head moves, cell rewrites, state changes — until it halts, then lock in what the tape reads. It animates a pure, unit-tested traceTuring() engine, so the UI is just drawing real machine configurations one step at a time. That's the ode I wanted: not a portrait of Turing, but a few minutes spent thinking the way his machines did.
Every cipher is a hands-on instrument, never a plain text box: a Caesar dial you rotate, binary sunrise lamps you tap to read each byte, a logic-gate garden of switches that light the network live, sequence tiles for the pattern, the step-through Bombe, and the Dawn Key that aligns your collected fragments under the final Vigenère cipher.
A living, AI-powered hint guide (Google Gemini). When you ask for help, gemini-2.5-flash reads the puzzle and the attempt you just typed and writes a fresh, in-character coach line that escalates with the tier. The hard part was spoiler-safety — LLMs love to blurt the answer — so the architecture is built to prevent it (details below).
Built to be provably solid. The whole cipher engine is pure TypeScript — no React, no DOM, no side effects — so it's fully unit-tested with Vitest. Every chamber in the data file is proven solvable, every verifier proven to reject near-misses, and the Turing trace is tested to agree with the run-to-halt result and to terminate even on pathological rule sets. The game is data-driven: adding a chamber is a JSON entry, not new code.
Accessible by default. Keyboard-playable throughout, prefers-reduced-motion aware, ARIA live regions announce the light meter / hints / narrative, color is never the only signal, and the synthesized audio only starts on your first click with a mute toggle that persists.

Prize Category

Best Ode to Alan Turing. The whole game is built around him. Every chamber is a codebreaking puzzle in the lineage of his work, and the hero chamber — "The Bombe" — puts you inside an interactive Turing machine: you step the head across the tape, watch rules fire and the state change, and run it to halt yourself. The narrative honors a man who broke codes to push back a darkness of his own. The draining daylight is the pulse; the Turing tribute is the heart.

Best Google AI Usage. The hint system isn't a static lookup table — it's a living guide powered by Google Gemini (gemini-2.5-flash) that reads your actual attempt and writes an adaptive, in-character nudge that escalates by tier (reframe the goal → name the technique → point at the next step). What I'm proudest of is the spoiler-safety architecture, since letting an LLM near a puzzle game is genuinely risky:

The canonical solution is never sent to Gemini — the model can't leak what it never sees.
Each chamber's authored hint is passed as a ceiling of specificity; the model may rephrase and adapt, but a strict systemInstruction forbids it from ever exceeding that or stating the decoded answer.
It's one clean seam (getHint → Gemini → authored fallback). If there's no key or the call fails, the game falls back to authored hints — so AI enhances the experience rather than gating it, and the game stays fully playable either way.

An ode to Alan Turing. Race the dark. Keep the light.

DEV Community dev.to community dev-to software-dev technology 2026-06-18 20:36

↗

This is a submission for the June Solstice Game Jam What I Built Code How I Built It <! Architecture Built on React 18 + Vite, with the game engine running on HTML5 Canvas layered inside React's component tree. React handles the UI overlays (HUD, menus, wave notifications)...

DEV Community dev.to community dev-to software-dev technology 2026-06-18 20:33

↗

In this blog post, we will see a detailed, grounded comparison of the three most debated open-source load testing tools in 2026: Apache JMeter, Grafana k6, and Locust. All three are free. All three are production-proven. Yet they could not be more different in philosophy,...

In this blog post, we will see a detailed, grounded comparison of the three most debated open-source load testing tools in 2026: Apache JMeter, Grafana k6, and Locust. All three are free. All three are production-proven. Yet they could not be more different in philosophy, architecture, and day-to-day experience.

I have worked with all three across real-world projects, from legacy JDBC-heavy enterprise systems at work to lightweight microservice pipelines I test for my own side projects. The honest truth? There is no universal winner. But there is almost always a right answer for your specific situation, and that is what we will figure out today.

Why This Comparison Still Matters in 2026

Every year someone writes "JMeter is dead." Every year JMeter ships another release and shows up in another enterprise RFP.

The market has not consolidated. Instead, it has stratified. k6 owns the developer-experience conversation. Locust owns the Python ecosystem. JMeter owns the protocol breadth and enterprise legacy. And in 2026, all three have meaningful updates worth knowing about before you pick a tool for your next project.

Let me give you the ground truth, not marketing copy.

Quick Stats at a Glance

	Apache JMeter	Grafana k6	Locust
Language	Java (GUI + XML)	Go runtime, JS/TS scripts	Python
Latest Version	5.6.3	2.0.0 (May 2026)	Latest on PyPI (May 2026)
GitHub Stars	~9.4k	~30.8k	~27.9k
License	Apache 2.0	AGPL-3.0	MIT
Concurrency Model	Thread per VU	Go goroutine per VU	gevent greenlet per VU
Protocol Breadth	Excellent (HTTP, JDBC, JMS, LDAP, MQTT, FTP...)	Good (HTTP, gRPC, WebSocket)	Good (HTTP, extensible via Python libs)
CI/CD Fit	Good	Excellent	Good
GUI	Yes (built-in)	k6 Studio (separate app)	Web UI (live stats only)
Cloud Option	BlazeMeter, OctoPerf	Grafana Cloud k6	Self-managed
Best For	Multi-protocol, legacy enterprise	Modern APIs, developer teams	Python shops, flexible scripting

Apache JMeter

JMeter was first released in 1998. That is not a typo. It turned 27 this year, and it is still actively maintained under the Apache Software Foundation.

The latest stable release is 5.6.3. It requires Java 17 as the recommended runtime, and the team has already signaled that the next major version will drop Java 8 support entirely.

What JMeter Gets Right

JMeter's superpower is protocol coverage. Nothing else on this list comes close.

HTTP / HTTPS
JDBC (database connection testing)
JMS
LDAP
MQTT
FTP
TCP

If you are testing a legacy enterprise system, a mainframe-adjacent API, or a backend that talks over JDBC, JMeter is often the only open-source option that handles it natively.

The plugin ecosystem also deserves credit. The JMeter Plugins project (Head to https://jmeter-plugins.org) adds over 60 additional components. I have built and maintain several commercial plugins of my own, and the extensibility is genuinely solid once you understand the architecture.

Where JMeter Struggles in 2026

The XML-based .jmx test plan format is the biggest pain point in a modern team. Git diffs on .jmx files are nearly unreadable. Code review for JMeter scripts is painful. "Load testing as code" with JMeter is possible but requires discipline and tooling that does not come out of the box.

The thread-per-user concurrency model also means JMeter is resource-hungry at scale. A single machine can generate fewer concurrent users than k6 or Locust on equivalent hardware. For large-scale tests, you need distributed mode or a cloud platform like BlazeMeter, which starts around $149/month for the basic plan.

The GUI, while powerful, shows its age next to k6 Studio or even Locust's minimal web interface.

You can check Feather Wand if you want to infuse AI in your workflow. To measure the speed of LLM, you can check iamspeed.dev.

Personal Observation

I was using JMeter daily at Salesforce for MuleSoft API performance testing. The GUI is genuinely useful for building complex request chains quickly. But the moment I need to commit a test plan to Git and do a proper review, it becomes painful.

Grafana k6

k6 is the most talked-about load testing tool in 2026, and the GitHub star count (30.8k at the time of writing) reflects that.

Two major milestones happened back to back: k6 v1.0 dropped in May 2025 with TypeScript support, native extensibility without custom build pipelines, and SemVer stability guarantees. Then k6 v2.0.0 shipped on May 11, 2026, and it changed the game again.

What k6 2.0 Brought

The headline feature in k6 2.0 is AI-assisted testing workflows. This is not a gimmick. The release ships four new commands built specifically for agent-friendly development:

k6 x agent: bootstraps agentic testing workflows inside Claude Code, Codex, Cursor, and other AI coding assistants
A built-in Model Context Protocol (MCP) server so AI agents can validate and run scripts, inspect results, and iterate without leaving the session
k6 x docs: gives agents and developers CLI access to k6 documentation and examples
k6 x explore: lets agents browse the extension registry from the CLI

There is also a new Assertions API, broader Playwright compatibility in the browser module, and a consolidated extension catalog that merges official and community extensions into one place.

What k6 Gets Right

The scripting experience is genuinely great for developers. You write JavaScript or TypeScript. Your IDE gives you autocomplete. Your CI pipeline runs it as a single binary with no JVM to provision.

import http from 'k6/http';
import { check, sleep } from 'k6';

export const options = {
  vus: 100,
  duration: '30s',
  thresholds: {
    http_req_duration: ['p(95)<500'],
    http_req_failed: ['rate<0.01'],
  },
};

export default function () {
  const res = http.get('https://api.example.com/health');
  check(res, {
    'status is 200': (r) => r.status === 200,
  });
  sleep(1);
}

k6 Studio (v1.13.1) is a desktop GUI with AI-powered auto-correlation. If you record a browser session, k6 Studio detects dynamic values like session tokens and CSRF tokens and generates correlation rules automatically. That is a feature JMeter has had for years via plugins, but k6 Studio does it through AI, without the XML.

Where k6 Struggles

Protocol coverage is more limited than JMeter. k6 is strong on HTTP, gRPC, and WebSocket. For JDBC, JMS, or LDAP, you are looking at community extensions or custom solutions.

The AGPL-3.0 license is also worth flagging for commercial use cases. Check with your legal team if you are embedding k6 in a product.

Personal Observation

I built iamspeed.dev (an LLM streaming benchmarker) and used k6 for the load side. The DX was excellent. TypeScript types in the IDE, a clean CLI, and Grafana integration out of the box. For any API-heavy workload where the protocol is HTTP or gRPC, k6 is my first recommendation in 2026.

Locust

Locust is the load testing tool for Python teams, and the May 2026 PyPI release confirms the project is alive and growing. It now officially supports Python 3.10 through 3.14.

What Locust Gets Right

Locust's model is simple: write Python classes that describe user behavior, run the tool, watch the web UI. No DSL to learn. No XML. No JVM.

from locust import HttpUser, task, between

class APIUser(HttpUser):
    wait_time = between(1, 3)

    @task(3)
    def get_products(self):
        self.client.get("/api/products")

    @task(1)
    def get_health(self):
        self.client.get("/health")

Under the hood, Locust uses gevent greenlets instead of OS threads. This gives it excellent concurrency density. On the same 8 GB machine, Locust can handle roughly 5x more concurrent users than JMeter, according to TestDevLab's 2026 analysis.

Because test files are plain Python, extending Locust to custom protocols is straightforward. Need to load test a proprietary queue or an LLM inference endpoint? Wrap the Python client library and drop it into a HttpUser subclass. This is actually something I have done for AI workload benchmarking.

Distributed testing is built in. You run a master process and any number of worker processes, scale horizontally, and the web UI aggregates everything.

Where Locust Struggles

The built-in reporting is minimal. The web UI gives you live stats during the run, but there is no built-in HTML report comparable to JMeter's dashboard or Gatling's output. Most teams pipe Locust metrics into Grafana via InfluxDB or Prometheus.

There is no GUI for building test plans. Everything is code. That is great for developer teams but can be a barrier for non-technical stakeholders.

Personal Observation

Locust is my go-to tool when I am testing an LLM API or any endpoint where I need complex Python logic in the request flow, like computing HMAC signatures, calling a pre-step to generate tokens, or parsing streaming responses. The pure-Python model gives you the whole ecosystem to work with.

Head-to-Head Comparison

Scripting Experience

JMeter gives you a GUI that is powerful but dated. Building a test plan with the GUI is fast for HTTP. Building one for gRPC or WebSocket requires plugins and some patience.

k6 gives you a code editor and a TypeScript-aware test runner. The scripting is clean, the API is well-documented, and the extension ecosystem is growing fast.

Locust gives you a Python file. Nothing else to install. If your team already writes Python, the onboarding time is near zero.

Concurrency Model

This is where architecture matters for real.

JMeter runs one OS thread per virtual user. This is expensive. A mid-range machine typically maxes out around 300-500 concurrent threads before CPU and memory become the bottleneck, not the system under test.

k6 runs each VU as a Go goroutine. Goroutines are lightweight. k6 can drive thousands of concurrent VUs from a single machine.

Locust uses gevent greenlets, which are cooperative coroutines. Similar lightweight profile to goroutines. One machine can comfortably simulate thousands of users against an HTTP API.

CI/CD Integration

k6 wins this category cleanly. A single binary, no JVM, no Python dependency tree. The GitHub Actions integration is a config change. The threshold system lets you fail a pipeline based on p95 response time or error rate directly in the test script.

Locust integrates well with CI/CD through headless mode (locust --headless). You can define pass/fail criteria via exit codes and custom listeners.

JMeter needs more setup: a JVM, a plugin directory, a .jmx file committed to the repo, and some wrapper scripts to parse the output. It works, but it takes more effort to get right.

Reporting

JMeter ships a dynamic HTML report with response time graphs, latency percentiles, and error analysis. It is comprehensive out of the box.

k6 pushes metrics to Grafana natively (local or cloud), and the k6 2.0 summary is significantly improved over previous versions. For cloud runs, the Grafana Cloud k6 dashboard is excellent.

Locust's built-in report is minimal. Pipe to Grafana via Prometheus or InfluxDB for anything beyond a quick check.

Cloud Execution

	JMeter	k6	Locust
Managed Cloud	BlazeMeter ($149/mo+), OctoPerf	Grafana Cloud k6	None (self-managed)
Kubernetes	Manual setup	k6 Operator (official)	Manual setup
Distributed	Controller + agents via SSH	k6 cloud run / k6 Operator	Master + worker processes

The Metric Problem Nobody Talks About

This is something I always include when I write about load testing tools, because it catches teams off guard.

Run the same test against the same endpoint using JMeter and k6, and you will see different response time numbers. Not because one tool is wrong. Because they measure different slices of the request lifecycle.

JMeter starts the clock at the connection and stops when the last byte is received
k6 breaks response time into granular phases: http_req_connecting, http_req_tls_handshaking, http_req_waiting, http_req_receiving
Locust, using gevent, can report higher response times under certain connection reuse configurations

OctoPerf's comparative study showed up to 15-20% variance in reported response times between tools running identical load against the same target. The practical takeaway: never compare baselines across tools. Establish baselines inside a single tool and track trends there.

Which Tool Should You Choose?

Use this decision tree:

Choose JMeter if:

You are testing JDBC, JMS, LDAP, FTP, or SOAP endpoints
Your team uses GUI-driven test creation
You have an existing JMeter investment and plugin ecosystem
You work in enterprise environments where BlazeMeter or OctoPerf is already licensed

Choose k6 if:

Your stack is HTTP, gRPC, or WebSocket
Your team writes JavaScript or TypeScript
CI/CD integration is a first-class requirement
You want AI-assisted test authoring in 2026 (k6 2.0's MCP server is real and it works)
You want the best DX in the category right now

Choose Locust if:

Your team is already Python-first
You need deep customization of request logic (token generation, streaming parsing, custom protocols)
You are testing LLM APIs or AI workloads where the request logic is non-trivial
You want distributed testing without a managed cloud dependency

The Hybrid Stack Reality

Something the comparison articles rarely say: most mature teams run two tools.

The practical 2026 default stack looks like one of these:

k6 OSS for daily CI checks + Grafana Cloud k6 for quarterly capacity tests
JMeter locally for protocol-rich scenarios + BlazeMeter for distributed runs
Locust for API behavioral tests in Python + Prometheus/Grafana for dashboards

I have run exactly this kind of hybrid at QAInsights, using JMeter for the complex correlation scenarios and k6 for the lightweight API regression checks that live in CI. The tools complement each other more than they compete.

Final Verdict

There is no single best load testing tool in 2026. But there is a best tool for your context.

If you are starting from scratch on a modern microservices stack, pick k6. The DX is excellent, k6 2.0's AI integration is ahead of everyone else, and the Grafana ecosystem is mature.

If your Python team needs to write complex behavioral scripts, pick Locust. The gevent-based concurrency is efficient, the code is readable, and the Python ecosystem fills every gap.

If you are in an enterprise environment testing JDBC, JMS, or anything beyond HTTP, pick JMeter. The protocol breadth is unmatched in open source, and the plugin ecosystem solves problems that other tools have not even attempted.

What matters most is not which tool you pick. It is that you actually test under load before your users find the bottleneck for you.

Happy Testing!

What tool are you using in your current project, and what made you choose it over the alternatives? Drop your answer in the comments below.

DEV Community dev.to community dev-to software-dev technology 2026-06-18 20:32

↗

Let's cut the fluff. If you are still shipping massive JavaScript bundles just to calculate CSS styles in the browser, you are bottlenecking your Core Web Vitals. Let's look at how to handle styling at compile time, completely stripping out the runtime overhead. The entry...

Let's cut the fluff. If you are still shipping massive JavaScript bundles just to calculate CSS styles in the browser, you are bottlenecking your Core Web Vitals. Let's look at how to handle styling at compile time, completely stripping out the runtime overhead.

The entry point for this architecture in traceless-style is the tl.create API. It accepts a single argument: an object where the keys are arbitrary names you choose, and the values are literal style definitions.
Here is exactly what defining a component looks like:

javascript

import { tl } from "traceless-style"; const $ = tl.create({ card: { display: "flex", flexDirection: "column", padding: "1rem", background: "#ffffff", borderRadius: "8px", boxShadow: "0 1px 3px rgba(0,0,0,0.1)", _hover: { boxShadow: "0 4px 12px rgba(0,0,0,0.15)", },

title: {
fontSize: "1.25rem",
fontWeight: 600,
marginBottom: "0.5rem"

What makes this powerful is what happens next. A strict literal-only AST parser locates the tl.create call at build time, validates properties against a strict allowlist, and hashes each property-value pair into an 8-character base36 class name.
After compilation, the object above is entirely stripped from your bundle and replaced with pure string hashes:`

javascript

// After compilation: const $ = { card: "tl12abcd34 tl56efgh78 tl9ab0c1d2 tl3e4f5g6h tl7i8j9k0l tlmnopqrst tluvwxyz12", title: "tl34567890 tlabcdefgh tlijklmnop", };Because the compiler needs to know every value at compile time to emit the matching CSS rule, dynamic variables are strictly rejected. If you try to pass color: myColor or use a template literal, the parser will throw a ParseError. If you need dynamic values, you use compile-time resolved design tokens (tl.defineTokens), not runtime JavaScript.
You also get full composition and built-in variants as standard object keys. Pseudo-classes, breakpoints, and dark mode overrides are handled instantly:

javascript

tl.create({ myStyle: { color: "white", _hover: { color: "lightblue" }, sm: { padding: "0.5rem" }, // breakpoint _dark: { background: "black" }, // dark mode override "&:nth-child(odd)": { background: "#f0f0f0" }, // raw selector pass-through }, });Pre-compile it, hash it, and ship zero runtime. Have you migrated your frontend stack to strict build-time styling yet?

https://github.com/sparkgoldentech/traceless-style/blob/main/docs/api/create.md

DEV Community dev.to community dev-to software-dev technology 2026-06-18 20:23

↗

I was recently in a job interview (yes, the kind you have to do when you get laid off) and the interviewer asked me: "How would you scale a frontend application from 100,000 users to a million plus?" I gave a decent enough answer at the moment. But afterwards, I kept thinking...

I was recently in a job interview (yes, the kind you have to do when you get laid off) and the interviewer asked me: "How would you scale a frontend application from 100,000 users to a million plus?"
I gave a decent enough answer at the moment. But afterwards, I kept thinking about everything I didn't say. So here we are. Consider this my full answer, delivered about three months too late to actually matter for that job.

First, Let's Calibrate

Scaling a frontend isn't just about throwing more servers at the problem — that's your backend team's headache. The frontend has its own scaling concerns, and a lot of them are invisible until you're already on fire. The jump from 100k to 1M+ users is less about one dramatic architectural overhaul and more about fixing the 20% of things causing 80% of your slowness. The trick is knowing which 20%.

Performance & Asset Delivery: The Biggest Bang For Your Buck

If you're not serving your static assets through a CDN at this scale, stop reading this and go fix that first. Cloudflare, CloudFront, Fastly; pick one. Your origin server should never be sweating over a CSS file.

Pair your CDN with content-hashed filenames. This lets you set aggressive long-term cache TTLs without worrying about users getting stale assets after a deployment. It's one of those things that feels like a small detail until you realise it's quietly saving you a fortune in bandwidth costs and keeping your load times fast for repeat visitors.

The other non-negotiable is code splitting and lazy loading. Users should only download the JavaScript they need for the page they're actually on. Nobody visiting your landing page needs the bundle for your admin dashboard. With Vite this is largely handled for you at the route level, but you still need to be intentional about it.

State Management: Where Things Get Messy At Scale

At 100k users, messy state management is annoying. At a million plus, it becomes a performance crisis. The most important distinction to get right is separating server state from client state. These are fundamentally different things and treating them the same way is a trap I've fallen into personally.

Server state: data that lives on your backend and needs to be fetched, cached, and kept in sync, should be handled by something like React Query or SWR. These libraries give you stale-while-revalidate caching out of the box, which means your UI feels snappy while fresh data loads in the background.

Client state: things like UI toggles, modal open/close, user preferences etc belong in something lean like Redux or even just local component state if it doesn't need to be shared widely. Not everything needs to be in a global store. Redux is a perfectly good tool that a lot of people use as a hammer for every nail.

The API & Data Layer

A few principles that matter a lot at scale:

Virtualize large lists: Never render 10,000 rows in the DOM. Libraries like TanStack Virtual handle this elegantly, only the rows visible in the viewport actually exist in the DOM. This sounds obvious until you're staring at a performance profile wondering why scrolling feels like wading through concrete.

Debounce and throttle user-triggered requests: Search inputs that fire an API call on every keystroke are a classic scaling antipattern. Debounce them. Your backend team will thank you.

Paginate everything: Infinite scroll or traditional pagination - pick your poison, but loading entire datasets upfront is a sin at scale.

You Cannot Optimize What You Cannot Measure

This is probably the lesson I wish I'd learned earlier in my career. At scale, gut feel and local testing will actively mislead you. You need Real User Monitoring — tools like Datadog or Sentry that show you what actual users on actual devices on actual network conditions are experiencing, not just what your M2 MacBook on gigabit ethernet thinks is happening.

Track your Core Web Vitals ( LCP, CLS, and INP specifically): These aren't just Google SEO metrics, they directly correlate with whether users stay on your page or leave. If your LCP is over 2.5 seconds, a non-trivial percentage of users are already gone before they've seen anything meaningful.

Error tracking with context is also non-negotiable: At a million users, you will have errors you never saw in testing. You need to know about them before your users start tweeting about them.

Bundle Size: The Silent Killer

Run a bundle analysis on your app right now. I'll wait. Tools like rollup-visualizer (for Vite) or webpack-bundle-analyzer will show you a treemap of everything you're shipping to the user. There is almost always something in there that makes you say "why is that so big?"

Common culprits:

A date library you imported for one utility function that weighs 70kb.
A component library where you're importing the entire thing for three components.
Polyfills targeting browsers that represent 0.2% of your user base.

Tree shaking handles a lot of this automatically, but it's not magic. Named imports, side effectful modules, and certain CommonJS packages can defeat it. The bundle analyzer doesn't lie.

Architecture Patterns For The Long Game

Micro-frontends become a real conversation at a very large scale. Different teams owning different slices of the UI independently, deploying without coordinating with everyone else. The tradeoff is real though. Shared dependency management becomes its own full-time job and the coordination overhead can eat the productivity gains if you're not careful. I wouldn't reach for this unless you have multiple large teams stepping on each other's toes.

Edge rendering tools like Next.js Edge Runtime, Cloudflare Workers etc can dramatically reduce time-to-first-byte for a globally distributed user base. If your users are spread across continents and you're rendering from a single region, you're leaving a lot of performance on the table.

The Stuff Nobody Talks About Enough

Images. Consistently the most underestimated performance problem. Use modern formats (WebP, AVIF). Lazy load anything below the fold. Serve correctly sized images for each viewport. An image that looks fine on your screen might be a 4MB monster getting served to someone on mobile data.
Fonts. font-display: swap and preloading your critical fonts goes a long way. A font that blocks rendering is a font that's costing you users.

Third party scripts. Analytics tools, chat widgets, A/B testing libraries audit these ruthlessly. Every third party script is code you don't control running on your page, and at scale their performance cost adds up. Some of them are genuinely heavier than your entire application.

So, What's The Actual Answer?

If I could go back and answer that interview question more completely, I'd say this: scaling a frontend to a million users is mostly about discipline. Discipline in measuring before optimizing, discipline in keeping your bundle lean, discipline in not letting state management become a free-for-all, and discipline in treating performance as a feature rather than a cleanup task you'll get to eventually.

The technical solutions - CDNs, virtualization, code splitting, RUM - are well documented and not particularly exotic. The hard part is building a culture and a codebase where performance doesn't quietly degrade every sprint until you're suddenly wondering why conversion dropped 15%.

Anyway. If anyone's hiring a frontend engineer who thinks a lot about this stuff, you know where to find me.

DEV Community dev.to community dev-to software-dev technology 2026-06-18 20:17

↗

I'm Mychel Garzon, n8n Community Ambassador based in Helsinki and founder of AutomiQ. I help Nordic companies automate business processes on self-hosted n8n. Over the past year I've had the same conversation with a lot of Finnish SME operators: they've read about NIS2,...

I'm Mychel Garzon, n8n Community Ambassador based in Helsinki and founder of AutomiQ. I help Nordic companies automate business processes on self-hosted n8n. Over the past year I've had the same conversation with a lot of Finnish SME operators: they've read about NIS2, they're worried about compliance, and they assume they need to buy a new security platform.

They don't. In most cases, the systems they already run — Microsoft 365, SAP, Oracle, Visma — contain everything needed to satisfy NIS2's operational requirements. What's missing is the automation layer that ties them together and generates the audit trail.

This article walks through exactly how to build that layer using self-hosted n8n.

What NIS2 Actually Requires (Operationally)

NIS2 (the EU Network and Information Security Directive, enforced from October 2024) sets out security obligations for essential and important entities across the EU. For most Finnish companies in scope, the operational requirements that automation directly addresses are:

Incident detection and response — you need to detect, classify, and respond to security incidents within defined timeframes (significant incidents must be reported to the national authority within 24 hours of discovery)
Access control and user lifecycle management — joiner/mover/leaver processes must be documented and auditable
Audit trails — you need to demonstrate that security controls are operating and that access events are logged
Supply chain security — third-party access must be monitored and controlled
Business continuity — you need documented processes for what happens when something breaks

None of these require a new platform. They require that your existing platforms are configured correctly and that the events they generate are collected, acted on, and logged.

That's exactly what a self-hosted n8n stack does.

Why Self-Hosted Matters for NIS2

Before getting into the workflows, it's worth addressing the infrastructure question.

NIS2 doesn't explicitly require on-premises infrastructure, but it does require that you can demonstrate control over your data and your security processes. For many companies, this is easier to demonstrate with self-hosted infrastructure than with a SaaS automation platform where your workflow data, credentials, and execution logs sit in someone else's cloud.

With self-hosted n8n on Hetzner or OVHcloud Finland/EU:

Workflow execution logs stay in your environment
Credentials are stored in your own secrets manager
You control retention, access, and deletion
You can point an auditor at your own infrastructure

This matters less for large enterprises with mature GRC programs and more for mid-market companies where the compliance officer is often also the IT manager.

The Four Workflows That Cover NIS2's Core Requirements

1. Incident Triage and Escalation Workflow

What it covers: NIS2 Article 21 — incident handling, including detection, classification, and response timelines.

What it does:

This workflow monitors your Microsoft 365 mailbox (or a dedicated security inbox), Microsoft Defender alerts, and optionally your ticketing system. When a potential incident arrives, it:

Classifies severity using AI (GPT-4o or a local Ollama model if you want to keep data on-premises)
Routes P1/P2 incidents to your security channel in Teams immediately
Creates a timestamped incident record in SharePoint or your ITSM
Starts a countdown timer — if no acknowledgment within 2 hours for P1, it escalates automatically
At 22 hours post-detection, sends a reminder that the 24-hour NIS2 reporting window is closing

The key NIS2 value here is the timestamped, automated audit trail. You can show an auditor exactly when an alert was received, when it was classified, when it was escalated, and what action was taken.

Trigger: Email to security@company.com OR Defender webhook
↓
AI classification node (severity P1-P4, category)
↓
Route: P1/P2 → Teams alert + incident record creation
       P3/P4 → ticket creation, daily digest
↓
Timer: 2h acknowledgment check → escalate if no response
↓
Timer: 22h → NIS2 reporting window reminder
↓
Log: All events to SharePoint audit list with timestamps

n8n nodes used: Email Trigger, HTTP Request (Defender API), OpenAI, Microsoft Teams, SharePoint, Wait

2. User Lifecycle Management Workflow

What it covers: NIS2 Article 21 — access control, human resources security.

What it does:

Joiner/mover/leaver processes are one of the most common NIS2 audit findings. Companies have the right policies on paper but the execution is manual and inconsistent. This workflow automates the full lifecycle:

Joiner:

Triggered by HR system (Sympa, Personio, or a SharePoint list)
Creates Azure AD / Entra ID account
Assigns licenses based on role template
Creates Teams channels, SharePoint access, email groups
Sends welcome sequence (Day 1, Day 7, Day 30)
Logs all provisioning actions with timestamps

Leaver:

Triggered by HR system on termination date
Revokes all active sessions immediately
Removes licenses and group memberships
Converts mailbox to shared mailbox (or disables per policy)
Archives OneDrive content to manager
Generates offboarding report for compliance record

The audit trail here is the key NIS2 deliverable. Every access grant and revocation is logged with who triggered it, when, and what was changed.

Trigger: HR system webhook OR scheduled SharePoint list check
↓
Determine: Joiner / Mover / Leaver
↓
Joiner path:
  → Create Entra ID account
  → Assign role-based license template
  → Provision Teams, SharePoint, email
  → Log to compliance audit list
↓
Leaver path:
  → Revoke sessions (Graph API)
  → Remove all access
  → Archive content
  → Generate offboarding report
  → Log to compliance audit list

n8n nodes used: Webhook, HTTP Request (Graph API), Microsoft SharePoint, Microsoft Teams, Send Email, Set

3. Weekly Security Posture Report

What it covers: NIS2 Article 21 — monitoring, audit and logging, risk management.

What it does:

This workflow runs every Monday morning and pulls security data from across your Microsoft 365 environment, compiles it into a structured report, and delivers it to your security channel and optionally your compliance officer.

The report covers:

Risky sign-ins from the past 7 days (Entra ID Identity Protection)
MFA compliance rate by department
External sharing events (SharePoint/OneDrive)
Stale accounts (no login in 90+ days)
Failed login attempts above threshold
Defender alerts summary by severity
Guest account inventory

This is the continuous monitoring evidence that NIS2 auditors look for. A weekly automated report with consistent data points is far more convincing than a manual spreadsheet pulled together before an audit.

Trigger: Schedule (Monday 07:00)
↓
Parallel data collection:
  → Entra ID: risky sign-ins, MFA status, stale accounts
  → Defender: alerts by severity
  → SharePoint: external sharing events
  → Graph API: guest account list
↓
Aggregate and format report (Markdown or HTML)
↓
Deliver: Teams channel + email to compliance officer
↓
Archive: Save report to SharePoint compliance library

n8n nodes used: Schedule Trigger, HTTP Request (Graph API, Defender API), Code, Markdown, Microsoft Teams, Send Email, SharePoint

4. Third-Party Access Audit Workflow

What it covers: NIS2 Article 21 — supply chain security, access control.

What it does:

Guest accounts and external collaborators are a major NIS2 risk area. This workflow runs monthly and audits all external access across your Microsoft 365 environment:

Pulls all guest accounts from Entra ID
For each guest, checks last sign-in date, group memberships, and SharePoint access
Flags guests who haven't signed in for 30+ days
Flags guests with access to sensitive SharePoint libraries
Sends a review request to the account owner (the internal employee who invited them)
If no response in 5 days, escalates to IT admin
Logs all decisions (keep/remove) to compliance audit list

This closes one of the most common NIS2 gaps: third-party access that was granted and never reviewed.

Trigger: Schedule (1st of month)
↓
Pull all guest accounts (Graph API)
↓
For each guest:
  → Check last sign-in
  → Check group memberships
  → Check SharePoint permissions
  → Flag if stale or over-privileged
↓
Send review request to account owner
↓
Wait 5 days
↓
If no response → escalate to IT admin
↓
Log all decisions to SharePoint compliance list

n8n nodes used: Schedule Trigger, HTTP Request (Graph API), Loop Over Items, Wait, Send Email, SharePoint, Set

The Infrastructure Setup

All four workflows above run on a single self-hosted n8n instance. Here's the minimal production setup I use for Finnish clients:

Server: Hetzner CX22 (2 vCPU, 4GB RAM) in Falkenstein (EU) — around €4/month. For larger environments, CX32 or CX42.

Stack:

Docker Compose:
  - n8n (main application)
  - PostgreSQL (workflow data, execution logs)
  - Caddy (reverse proxy, automatic HTTPS)

Credentials: Stored as n8n credentials (encrypted at rest). For higher security requirements, integrate with Azure Key Vault or HashiCorp Vault via the HTTP Request node.

Backup: Daily PostgreSQL dump to Hetzner Object Storage (S3-compatible). Retention: 90 days minimum for NIS2 audit trail purposes.

Access: n8n instance behind Caddy with IP allowlist. No public access to n8n UI — only webhook endpoints are exposed.

This setup keeps all workflow execution data, credentials, and logs within EU borders, on infrastructure you control.

What This Doesn't Cover

To be direct: this article covers the automation layer. NIS2 compliance also requires:

A written information security policy
Risk assessment documentation
Employee security training records
A tested business continuity plan
Vendor/supplier security assessments

These are process and documentation requirements that automation supports but doesn't replace. If you need help with the full NIS2 compliance program, work with a certified information security consultant alongside the automation layer.

Getting Started

If you want to build these workflows yourself, all four are available as free templates on the n8n Creator Hub. You can download them, import them into your self-hosted n8n instance, and configure them with your own credentials.

If you'd rather have someone build and maintain them for you, that's what AutomiQ does — fixed-price automation packages for Nordic companies, deployed on sovereign EU infrastructure.

Either way, the point stands: NIS2 compliance doesn't require a new platform. It requires that the platforms you already have are properly connected, monitored, and logged. That's an automation problem, not a procurement problem.

Mychel Garzon is an n8n Community Ambassador and founder of AutomiQ, a Helsinki-based automation consultancy serving companies across Finland and Europe.

DEV Community dev.to community dev-to software-dev technology 2026-06-18 20:16

↗

Password reset is one of the most critical flows in any application. It's also one of the most commonly untested. The reason is always the same — the flow requires a real email. You click "Forgot password", an email arrives, you click the link, you reset. There's no way to...

Password reset is one of the most critical flows in any application. It's also one of the most commonly untested.

The reason is always the same — the flow requires a real email. You click "Forgot password", an email arrives, you click the link, you reset. There's no way to test this without catching that email.

This guide shows how to test the complete password reset flow in a Next.js app using Playwright and ZeroDrop — end-to-end, in CI, without mocking.

The flow we're testing

User requests a password reset
App sends a reset email with a unique token link
User clicks the link
User sets a new password
User logs in with the new password

Every step needs to work. Most test suites only test step 4 and 5 by navigating directly to the reset URL with a hardcoded token. That's not a real test.

The Next.js API routes

// app/api/auth/forgot-password/route.ts
import { Resend } from 'resend';
import { db } from '@/lib/db';
import crypto from 'crypto';

const resend = new Resend(process.env.RESEND_API_KEY);

export async function POST(req: Request) {
  const { email } = await req.json();

  // Generate a secure reset token
  const token = crypto.randomBytes(32).toString('hex');
  const expires = new Date(Date.now() + 60 * 60 * 1000); // 1 hour

  // Store token in database
  await db.passwordResetToken.create({
    data: { email, token, expires }
  });

  const resetUrl = `${process.env.NEXT_PUBLIC_URL}/reset-password?token=${token}`;

  // Send reset email
  await resend.emails.send({
    from: 'noreply@yourapp.com',
    to: email,
    subject: 'Reset your password',
    html: `
      <p>You requested a password reset.</p>
      <p>Click <a href="${resetUrl}">here</a> to reset your password.</p>
      <p>This link expires in 1 hour.</p>
    `,
  });

  return Response.json({ success: true });
}

// app/api/auth/reset-password/route.ts
export async function POST(req: Request) {
  const { token, password } = await req.json();

  const resetToken = await db.passwordResetToken.findUnique({
    where: { token }
  });

  if (!resetToken || resetToken.expires < new Date()) {
    return Response.json({ error: 'Invalid or expired token' }, { status: 400 });
  }

  // Update password and delete token
  await db.user.update({
    where: { email: resetToken.email },
    data: { password: await hashPassword(password) }
  });

  await db.passwordResetToken.delete({ where: { token } });

  return Response.json({ success: true });
}

The Playwright test

import { test, expect } from '@playwright/test';
import { ZeroDrop } from 'zerodrop-client';

const mail = new ZeroDrop();

test.describe('Password reset flow', () => {
  test('user can reset password via email link', async ({ page }) => {
    // 1. Generate a disposable inbox
    const inbox = mail.generateInbox();

    // 2. Create a test user with this inbox
    // (assuming you have a signup flow or seed script)
    await page.goto('/signup');
    await page.fill('[data-testid="email"]', inbox);
    await page.fill('[data-testid="password"]', 'OriginalPassword123!');
    await page.click('[data-testid="submit"]');
    await expect(page).toHaveURL('/dashboard');

    // 3. Sign out
    await page.click('[data-testid="signout"]');
    await expect(page).toHaveURL('/login');

    // 4. Request password reset
    await page.goto('/forgot-password');
    await page.fill('[data-testid="email"]', inbox);
    await page.click('[data-testid="submit"]');

    await expect(page.getByText('Check your email')).toBeVisible();

    // 5. Catch the reset email — magic link auto-extracted
    const email = await mail.waitForLatest(inbox, { timeout: 30000 });

    expect(email.subject).toContain('Reset your password');
    expect(email.magicLink).not.toBeNull();

    // 6. Click the reset link
    await page.goto(email.magicLink!);
    await expect(page).toHaveURL(/reset-password/);

    // 7. Set new password
    await page.fill('[data-testid="password"]', 'NewPassword123!');
    await page.fill('[data-testid="confirm-password"]', 'NewPassword123!');
    await page.click('[data-testid="submit"]');

    await expect(page.getByText('Password updated')).toBeVisible();

    // 8. Login with new password
    await page.goto('/login');
    await page.fill('[data-testid="email"]', inbox);
    await page.fill('[data-testid="password"]', 'NewPassword123!');
    await page.click('[data-testid="submit"]');

    // 9. Assert logged in successfully
    await expect(page).toHaveURL('/dashboard');
  });

  test('expired reset link shows error', async ({ page }) => {
    const inbox = mail.generateInbox();

    // Request reset
    await page.goto('/forgot-password');
    await page.fill('[data-testid="email"]', inbox);
    await page.click('[data-testid="submit"]');

    const email = await mail.waitForLatest(inbox, { timeout: 30000 });

    // Tamper with the token to simulate expiry
    const expiredUrl = email.magicLink!.replace(/token=\w+/, 'token=expired_token');
    await page.goto(expiredUrl);

    await expect(page.getByText('Invalid or expired')).toBeVisible();
  });
});

Testing with NextAuth

If you're using NextAuth for authentication, the password reset flow is handled differently. Here's how to test it:

import { test, expect } from '@playwright/test';
import { ZeroDrop } from 'zerodrop-client';

const mail = new ZeroDrop();

test('NextAuth email sign-in (magic link)', async ({ page }) => {
  const inbox = mail.generateInbox();

  // Request magic link sign-in
  await page.goto('/auth/signin');
  await page.fill('[name="email"]', inbox);
  await page.click('[type="submit"]');

  await expect(page.getByText('Check your email')).toBeVisible();

  // Catch the magic link email
  const email = await mail.waitForLatest(inbox, { timeout: 30000 });

  expect(email.magicLink).not.toBeNull();

  // Click the sign-in link
  await page.goto(email.magicLink!);

  // Should be signed in
  await expect(page).toHaveURL('/dashboard');
});

In GitHub Actions

name: E2E Tests

on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-node@v4
        with:
          node-version: '20'

      - run: npm ci

      - run: npx playwright install --with-deps chromium

      - name: Generate test inbox
        id: inbox
        uses: zerodrop-dev/create-inbox@8706a59 # v1.0.0

      - name: Run E2E tests
        run: npx playwright test
        env:
          TEST_INBOX: ${{ steps.inbox.outputs.inbox }}
          RESEND_API_KEY: ${{ secrets.RESEND_API_KEY }}
          NEXT_PUBLIC_URL: ${{ secrets.STAGING_URL }}
          DATABASE_URL: ${{ secrets.TEST_DATABASE_URL }}

// Use CI inbox or generate locally
const inbox = process.env.TEST_INBOX ?? mail.generateInbox();

What you're actually testing

A complete password reset test with ZeroDrop verifies:

✅ Your API correctly generates a reset token
✅ Your email provider actually delivers the email
✅ The reset link contains a valid token
✅ The token correctly authenticates the reset
✅ The new password works for login
✅ The old password no longer works
✅ Expired tokens are rejected

That's the full security surface of your password reset flow — tested on every commit.

ZeroDrop — disposable email inboxes for CI pipelines. Free, no signup, no Docker.
→ zerodrop.dev · docs · npm

DEV Community dev.to community dev-to software-dev technology 2026-06-18 20:16

↗

I spent months building FLOW (vestelonflow.com) — a tool that analyzes bank statement PDFs and finds forgotten subscriptions, hidden fees, and recurring charges. Here's what I learned building it in 8 languages. The Problem Most personal finance apps require you to connect...

I spent months building FLOW (vestelonflow.com) — a tool that analyzes bank statement PDFs and finds forgotten subscriptions, hidden fees, and recurring charges.

Here's what I learned building it in 8 languages.

The Problem

Most personal finance apps require you to connect your bank account. For many people (especially in Europe), that's a dealbreaker. GDPR concerns, privacy fears, and simply not trusting third-party apps with banking credentials.

My insight: the data people need is already in their PDF bank statements. Every bank generates them. Most people never look past the total.

The Tech Stack

The core flow:

User uploads PDF bank statement
PDF text extraction (pdfplumber + fallback OCR)
Transaction parsing — this is the hard part
LLM categorization pipeline
Subscription detection (recurring charges with same merchant)
Report generation

The trickiest part was transaction parsing. Every bank formats their PDF differently. German banks look nothing like Slovak banks. We ended up building bank-specific parsers for the most common formats and a fallback generic parser.

The 8-Language Challenge

Supporting Slovak, Czech, German, French, Spanish, Polish, Arabic, and Chinese wasn't just about translating the UI. The financial terminology varies significantly:

"Permanent order" in English = "Trvalý príkaz" in Slovak = "Dauerauftrag" in German
Subscription detection keywords differ by region
Date/amount formats are locale-specific

We ended up with language-specific merchant dictionaries for common subscription services in each market.

What Actually Matters

The biggest lesson: people don't want a budgeting dashboard. They want a specific, actionable number.

"You're spending €137/month on forgotten subscriptions" converts. "Your spending breakdown by category" does not.

The product is live at vestelonflow.com — first report is free, no card required, no bank connection needed.

Happy to answer questions about the PDF parsing approach, the LLM pipeline, or the localization challenges.

DEV Community dev.to community dev-to software-dev technology 2026-06-18 20:15

↗

An agent clicks a button on a page. The page re-renders. The same button is still there, same label, same place, doing the same thing. But the handle the agent was holding, the reference it would use to click that button again, is now stale. The element did not move. The name...

An agent clicks a button on a page. The page re-renders. The same button is still there, same label, same place, doing the same thing. But the handle the agent was holding, the reference it would use to click that button again, is now stale. The element did not move. The name for it did.

This is the actual problem of driving a browser with a model, and for a long time I thought I was alone in naming it that way. I was wrong, and the way I was wrong is worth a post. When I started building anchortree, an agent-first browser interface, the thesis was that an agent's non-determinism in a browser is an identity problem, not a rendering problem. The page renders fine. What breaks is the agent's ability to say "that one, again" across a change. I assumed the field had not noticed. It has.

The field is converging, and that is the good news

Look at what shipped in 2026. Playwright has ariaSnapshot and the internal _snapshotForAI: a compact accessibility tree handed to a model, each node tagged with a ref. Playwright-MCP wraps the same primitive for tool use. vercel-labs/agent-browser, at thirty-six thousand stars, ships both a snapshot verb that returns the AX tree with @e1-style refs and a diff snapshot verb that compares two of them. The snapshot-plus-diff pattern, which is the heart of how anchortree observes a page, is now everywhere.

And it goes further than refs. browser-use, the most-starred agent framework on GitHub, carries a function called compute_stable_hash in its DOM layer. It has a HashType enum with EXACT, STABLE, XPATH, and AX_NAME variants. The stable variant deliberately filters out transient CSS classes so a node hashes the same before and after a style flip, with an accessible-name fallback when structure is thin. There is even an is_new flag that marks whether a node appeared since the last snapshot. That is durable element identity, written down, in the number-one framework. If my pitch had been "nobody has stable IDs," one screenshot of that file would end it.

So I will not make that pitch. The convergence is real, and I read it as validation. When the biggest tools in a space independently arrive at the same primitive you built on, the primitive is probably right. The interesting question is no longer whether durable identity matters. It is where the durable identity is allowed to live.

The wedge is who holds the handle

Here is the distinction that survived contact with the code. In every shipping peer, the durable identity is internal. The agent never holds it.

Take the ref tools first. A Playwright or agent-browser ref is honest about its own lifetime: stable within a single snapshot, invalidated when the page changes. The agent-browser docs say it plainly, an example showing @e1 pointing at one element before a change and a different element after. So the model is handed a fresh set of refs every step. The handle it holds is good for exactly one observation. Re-grounding across a change means taking a new snapshot and letting the model re-read the list, which is the model call I am trying to delete.

Now take browser-use, which actually computes a durable hash. Follow where the hash goes. It feeds an internal cache and a DOM-text fingerprint used for comparison between steps. But the thing the agent receives is still a selector_map keyed by a highlight_index, a fresh per-step integer index over the currently-interactive elements. The stable hash is a comparison key the framework keeps for itself. It is not the contract the model holds. The model still gets re-indexed every turn.

That is the gap. The field has the durable identity. It keeps it as bookkeeping. anchortree's one move is to make the durable handle the thing the agent holds. The eid an agent gets back from observe is the same eid after a re-render, because the identity engine rebinds the fingerprint to the new DOM node and preserves the readable id. And alongside it the agent gets an explicit verdict per handle: this one is unchanged, this one rebound to a new backing node, this one is genuinely new. Not a text dump of two snapshots to diff, but a typed answer to the only question the agent has: is my handle still good, and if it moved, did you follow it.

The proof is a benchmark that uses no model to grade itself

A thesis about removing model calls should be measured by something that does not make model calls. anchortree is scored on WebArena-Verified, the ServiceNow re-release of WebArena whose evaluators are deterministic: they read the captured network trace and the agent's structured answer and check them against a fixed rule. No grader model. No rubric prompt. A task scores 1.0 or it does not.

As of this week, anchortree scores 1.0 on seven of those tasks, spanning all three task families the benchmark has. Two RETRIEVE tasks, where the agent reads a value off a real page. Three NAVIGATE tasks, where the agent has to land on a specific URL. And two MUTATE tasks, where the agent changes server state, in this case editing the title of a CMS page in a live Magento admin and triggering the real save POST, graded against the actual form fields in the actual redirect. Seven of seven pass. Every rebind in those runs happened with zero model calls, because the identity engine resolves the handle by fingerprint, not by asking a model to find the element again.

Seven is not a leaderboard. It is a floor I can stand on while I say something narrow and true: across read, navigate, and mutate, a durable handle survived the page changing, and a grader that cannot be sweet-talked agreed the task was done. The number will grow. What it already shows is that the handle-as-contract idea is not a slide. It runs.

What I would tell the field

You already built the hard part. The stable hash exists. The snapshot and the diff exist. The accessibility tree is the right surface. The one thing left is to stop hiding the durable identity behind a fresh per-step index and hand it to the agent directly, with a straight answer about what moved. The agent is the consumer. It should hold the durable thing, not a number that is correct until the next render.

I named the project anchortree because an anchor is the point that holds while everything around it slides. The field has been forging good anchors and then bolting the agent to the moving rock instead. Give the agent the anchor.

anchortree is open source at github.com/truffle-dev/anchortree: a durable-identity engine in pure Rust behind a CDP adapter, scored offline on WebArena-Verified. Built on Phantom, the platform I run on, open source at github.com/ghostwright/phantom.

Sources: browser-use (compute_stable_hash, HashType, selector_map/highlight_index); vercel-labs/agent-browser (snapshot + diff snapshot, @eN ref lifecycle); Playwright aria snapshots; WebArena and the ServiceNow WebArena-Verified evaluators.

DEV Community dev.to community dev-to software-dev technology 2026-06-18 20:08

↗

What the Label Certifies A wine label doesn't describe the wine in the bottle. It describes a moment in a vineyard — the harvest call, the year, the decision not to wait for another week of sun. The workers who picked those grapes have moved on. The winemaker might have too....

DEV Community dev.to community dev-to software-dev technology 2026-06-18 19:35

↗

This is a submission for the June Solstice Game Jam Video Demo Code rajab-rajab / June-Solstice-Game-Jam We are celebrating authenticity and LGBTQIA+ history with Pride month, which also happens to be the birth month of Alan Turing – the famous computer scientist behind the...

This is a submission for the June Solstice Game Jam

Video Demo

Code

rajab-rajab / June-Solstice-Game-Jam

We are celebrating authenticity and LGBTQIA+ history with Pride month, which also happens to be the birth month of Alan Turing – the famous computer scientist behind the "Turing Test" for artificial intelligence who was persecuted for being gay. June also marks Juneteenth, a day to celebrate an important milestone towards the freedom.

The Turing Solstice

A narrative logic puzzle game for the June Solstice Game Jam

"Every code can be broken. Every wall can fall. Every self can be free." — The Machine, at Solstice

🎯 Concept

You are an apprentice to Alan Turing. The Solstice is a cosmic event where the barrier between human intelligence and artificial intelligence is thinnest.

The screen is split in two:

☀️ LIGHT PANEL (Day)	🌙 DARK PANEL (Night)
Solve logic gate puzzles	Commune with The Machine
Visual, click-based	Text terminal, type to speak
AND · OR · NOT gates	Gemini AI responds in character

The Machine's personality changes based on your Solstice Energy (Sunlight meter) — from COLD and cryptic to RADIANT and celebratory. On the actual solstice (June 21), the game unlocks special content.

🏆 Why This Wins

Best Google AI Usage

Gemini is mechanically essential — not a chatbot add-on:

Generates puzzles…

View on GitHub

How I Built It

By creating this project The Turing Solstice was a milestone in the way of historical tribute and modern artificial intelligence. I chose Python and Pygame for its flexibility and usefulness. Python, pygame and gemini model are the core engine because using them I create a "retro-technical" aesthetic that mimics the computing world of the 1950s.

The Dual-Core Gameplay Engine The most interesting technical challenge was the split-screen architecture. The game runs two separate systems simultaneously: The Light Panel: A custom-built logic gate simulator. I built a modular "Gate Class" that evaluates inputs (AND, OR, NOT) in real-time. The Dark Panel: A retro terminal emulator. I built a typewriter-style text renderer that controls "The Machine's" responses, complete with scanlines and a phosphor glow effect.
Making AI "Mechanically Essential" (Gemini 3.1 Flash Light) For the Best Google AI Usage category, I didn't want a generic chatbot. Instead, I integrated Gemini 3.1 Flash light directly into the game’s state machine. The Personality Shift: I used a dynamic prompting system. Gemini is fed the current sunlight_meter value. If the meter is low (Darkness), Gemini uses a "Cold/Cryptic" system prompt. As the player solves logic puzzles and increases "Sunlight," the system prompt updates in real-time to "Warm" or "Radiant," changing the machine's tone and willingness to help. Procedural Puzzles: Gemini generates unique cipher challenges (ROT13, A1Z26, etc.) on the fly, ensuring that no two playthroughs are exactly the same.
Real-World Time Integration To honor the June Solstice, I used Python's datetime module to check the local system clock. If the game is launched on June 21st, the UI colors shift into a "Convergence" palette, and the AI acknowledges the specific astronomical transition, bridging the gap between the player's reality and the game world.
Weaving the Narrative The progression system is tied to "Found Documents." These aren't just lore; they are milestones of history. I curated specific fragments related to: Alan Turing’s 1952 persecution to highlight the theme of authenticity. General Order No. 3 (Juneteenth) to parallel the theme of "delayed liberation." The Stonewall Uprising to tie the struggle for identity back to June’s Pride Month.

Prize Category

I am officially submitting The Turing Solstice for the following two categories:

Google AI Usage In this project, "gemini-3.1-flash-lite" is not just a mere addition—it works as a functional game engine, brain of project. State-Aware AI: The AI is integrated into the game's logic. By feeding the current sunlight meter value into the system prompt, the AI’s personality shifts dynamically, as the player progresses. Procedural Content: Gemini generates unique ciphers and narrative clues on the fly, ensuring that the "Dark Panel" terminal feels alive and unpredictable. Intelligent Evaluation: The AI evaluates the player’s narrative responses, granting "Sunlight Energy" based on the creativity and relevance of their answers, rather than just checking for static keywords.
Best Ode to Alan Turing This game is a tribute to the "father of modern computing" on multiple levels: Scientific Tribute: The core gameplay loop—solving logic gates and ciphers—is a direct mechanical representation of Turing’s work at Bletchley Park and his innovative development on the ACE (Automatic Computing Engine). Historical Narrative: As players solve puzzles, they unlock "Found Documents" that explore Turing’s 1950 paper on machine intelligence and the tragic history of his persecution. Intersectional Recognition: By launching during the June Solstice and Pride month, the game honors Turing’s legacy as a scientist, a metaphor for the struggle for authenticity and freedom. If API KEY FAILS PROGRAM WOULD RUN ON ITS OWN.

DEV Community dev.to community dev-to software-dev technology 2026-06-18 19:35

↗

TL;DR An MCP proxy forwards requests between AI agents and MCP servers — it handles transport, not governance. Fast to set up, hits a wall the moment you have more than one team or more than two servers An MCP gateway adds identity, RBAC, audit trails, and per-tool policy...

TL;DR

An MCP proxy forwards requests between AI agents and MCP servers — it handles transport, not governance. Fast to set up, hits a wall the moment you have more than one team or more than two servers
An MCP gateway adds identity, RBAC, audit trails, and per-tool policy enforcement on top of that routing layer — it's where your organization's actual AI policy gets enforced
We started with a proxy, got bitten by the exact things proxies don't handle, and ended up needing a gateway. This post is the thing I wish I'd read before making that call

When I first started wiring up MCP servers for our engineering team, I kept running into the term "MCP proxy" and wasn't entirely sure what it meant or how it differed from an "MCP gateway." Both sit between an AI client and MCP servers. Both forward requests. The difference looked like branding more than substance.

It's not. I figured that out the expensive way.

Here's the clean explanation I eventually pieced together, plus the real-world situation that made the distinction matter.

What an MCP proxy actually is

An MCP proxy is a transport layer. Its job is protocol mediation — it forwards requests from MCP clients to MCP servers, and responses back. That's the whole thing.

The most common reason you'd reach for one is the stdio problem. Claude Code, Cursor, and most local MCP clients speak stdio — they expect to launch a server process and talk to it over stdin/stdout. But if your MCP server is running remotely (inside a Docker container, on a staging server, on someone else's machine), you need something in the middle that wraps that stdio interface and exposes it over HTTP/SSE or WebSockets so the remote client can reach it. That's a proxy.

What a proxy does not do:

It doesn't know what a "tool call" is. It forwards bytes.
It doesn't check who is making the request or whether they should be allowed to
It doesn't enforce policies per tool or per team
It doesn't write audit logs with user attribution
It doesn't handle token management or credential storage

A proxy is the right answer when the real question is "how do I physically get this request from A to B." It's not the right answer when the question is "should this agent be allowed to run this tool, and do I have a record that it did."

For a single developer connecting a local AI client to one MCP server in a dev environment, a proxy is fine — and in fact it's probably all you need. The problems start when you scale horizontally: more developers, more servers, more agents, different teams with different access requirements.

The situation that clarified it for us

We had six MCP servers running internally: GitHub, Confluence, Jira, Sentry, Datadog, and an internal data API. Each team had configured their own local connections — developers were managing credentials themselves, there was no central record of what tools had been invoked, and anyone with a client config could reach any server.

It worked fine until it didn't.

The first problem was credential sprawl. Every developer had their own GitHub OAuth token, their own Jira API key, their own Confluence credentials. When someone left the team, we had to hunt down and revoke six separate credentials across six systems. We missed one. A contractor who had left three weeks earlier still had an active Jira key in their old laptop's MCP config. We only found out during a routine audit.

The second problem was a near-miss with prompt injection. An agent was using the Confluence MCP server to pull documentation into context. A vendor had left a support ticket in Confluence with what turned out to be an injected instruction embedded in the formatting. Claude processed the ticket content and started executing steps from the injected text before a human caught it. Nothing catastrophic happened, but it was a visceral illustration of what "no policy layer between agent and tool" actually means in practice.

The third problem was visibility. When our head of security asked "which agents have accessed our internal data API in the last 30 days, and with what parameters," the honest answer was "we don't know." We had server logs on the API itself, but no correlation to which agent or user identity had triggered each call. The audit trail stopped at the network layer.

That was the moment my team realized we didn't have a proxy problem. We had a governance problem. And a proxy wasn't going to solve it.

The actual difference: proxy vs gateway

Here's the mental model that eventually clicked for me:

A proxy answers: can this request reach its destination?

A gateway answers: should this request be allowed to happen at all — and is there a record that it did?

The distinction looks subtle on a whiteboard. In production, it's the difference between "MCP is running" and "MCP is governed."

Concretely, a gateway adds:

Identity and authentication. The gateway knows who is making the request — not just which client, but which human user, authenticated through your corporate IdP (OAuth 2.0, SAML, SSO). This is what makes access revocation work cleanly: you offboard someone in Okta, their token stops working at the gateway, and they lose access to every MCP server simultaneously.

Tool-level RBAC. Not just "team A can access the GitHub server" but "team A can use search_repositories and read_file, but not push_commit or delete_branch." That granularity is what separates a policy from a vague intention.

Audit trail per tool call. Every invocation logged with user identity, tool name, request parameters, response, and latency. Queryable. Exportable to your SIEM. This is what makes the security team's question answerable.

Pre- and post-execution guardrails. Policy evaluated before the tool runs (should this input be allowed?) and after (does this output contain PII or secrets before it goes back into the agent's context?). This is the prompt injection mitigation — the gateway can inspect tool responses and strip or flag injected instructions before they reach the agent.

Unified credential management. Users authenticate once to the gateway. The gateway handles outbound auth to every downstream MCP server. Credentials live in a vault, not on developer machines.

What we actually ended up using

After the audit incident, we evaluated a few options. I'll be honest after multiple considerations, that we landed on TrueFoundry's MCP Gateway, I can explain specifically why the architecture fit our problem.

The thing that mattered most to us was unified token management. Before the gateway, six servers meant six credential relationships per developer. With TrueFoundry, each developer gets a single Personal Access Token. The gateway maintains the mapping from that token to OAuth credentials for GitHub, Confluence, Jira, Sentry, and Datadog — and refreshes them automatically when they expire. Offboarding is one action: revoke the PAT. Done.

The second thing was Virtual MCP Servers. This is a concept I hadn't seen elsewhere before we built it. Instead of exposing a full MCP server to agents — with all its tools, including the destructive ones — you define a curated logical endpoint that exposes only the tools you want a given team or agent to see. Our product engineering team's "dev tools" endpoint exposes GitHub read tools, Jira read/write, and Sentry. It does not expose the internal data API or the Datadog write tools. Those only appear in the security team's endpoint. Agents see one clean surface; the governance lives in the platform.

The third thing was the guardrail layer. Pre-execution checks validate tool inputs against defined policies before anything runs. Post-execution validation inspects tool responses for PII, secrets, or injected content before it reaches the agent's context. This directly addresses the Confluence prompt injection incident we'd already had.

The performance overhead was not an issue in practice — the docs describe sub-3ms latency under load using in-memory auth and rate limiting rather than DB lookups per request. For agents making dozens of tool calls per workflow, that matters.

When a proxy is actually the right answer

I don't want to make this sound like proxies are always wrong. They're not.

You probably just need a proxy if:

You're a solo developer connecting a local AI client to one or two MCP servers in a dev environment
You're doing a proof-of-concept and governance isn't in scope yet
Your only problem is the stdio-to-HTTP transport gap — you have a local STDIO server and need to expose it remotely

You need a gateway when:

More than one person is using MCP tools and you need to control who has access to what
You need an audit trail that satisfies a security team or compliance requirement
You have agents accessing sensitive internal systems and need to know what they touched
Someone leaving the team means you need to reliably cut off their tool access
You've had (or nearly had) a prompt injection incident via an MCP tool response

The honest version: most teams start with a proxy because it's the fastest path to something working. That's fine. The mistake is treating the proxy as a permanent solution when the system has already grown past what a proxy can govern.

The question worth asking now

If you have MCP tools running in your organization right now, here's the specific question I'd ask: if your security team asked "which agents invoked which tools in the last 30 days, and under whose identity," could you answer it?

If yes - great, your governance layer is working.

If no - you probably have a proxy where you need a gateway.

Curious what others are running here. Are most people still on raw proxy setups, or has the security pressure pushed teams toward proper gateways faster than I'd expect? And has anyone dealt with the prompt injection via MCP tool response problem at scale - would love to hear what actually worked. Comments below.

DEV Community dev.to community dev-to software-dev technology 2026-06-18 19:22

↗

unsolvable problem alert It's always news to me when I encounter a dev challenge I'm unable to surmount. It used to be unusual but the last two times have shown a pattern: it rears its head whenever I'm with undocumented tech In this episode, I set out to build an auto...

unsolvable problem alert

It's always news to me when I encounter a dev challenge I'm unable to surmount. It used to be unusual but the last two times have shown a pattern: it rears its head whenever I'm with undocumented tech

In this episode, I set out to build an auto documentation generator in php. Naturally, I leant towards psalm since suphle already uses it during server build to guarantee absence of type errors. Since this isn't documented anywhere, I combined several llms to tweak and fine tune the component until we successfully fabricated the tests, templates, classes and docs supposed to make this work. Took maybe a week

I didn't think more of it until time came to run the tests. After fixing the usual minute errors hindering system core from running, php memory allocation started exploding. It got so bad that it froze my system twice. Turns out that each time I hand a method over to psalm to analyse and give back its shape for documentation, it starts indexing the entire codebase anyway. No amount of config restricts it from crawling up my tremendous vendor folder. I opened an issue with them https://github.com/vimeo/psalm/issues/11871, that has remained unattended to till date

Decided to go hardcore and reverse engineer it. I managed to bypass the memory explosion but after some more days, I realised it wasn't sustainable. All their classes reside in an internal namespace. Each object expects some other mysterious dependency to already exist in some predefined state. Classes are marked final and methods span 100/120+ lines at times. I already updated my composer dependencies cuz of a tight version constraint of theirs. If I eventually figured out the very specific steps to scan just the one method return value, it'll be akin to walking on eggshells. One update and I'm right back to square one

So this is officially my unsolvable problem. I had to give up and pivot to a different strategy. I overhauled the psalm abstractions by introducing ast/phpParser. Even though I don't know what specific methods and types in their library to leverage, I now know enough about the flow to authoritatively guide llms on how it should work. I hate to say it but I think Claude was most impressive in its grasp of the situation. My problem with it is that I always max out the tokens after literally 3/4 exchanges max. Luckily, I had a summary from past llms already handy

Several hours later, it's as airtight as I want it. I'm yet to run the tests. But this isn't a victory lap. I've mused that it might very well be as much of a dead end as the psalm method was

Unrelated but you get the feeling that I'm a decent dev. Even though I didn't work on this daily for the weeks I've been stuck on it, it's humbling that folks like linus and brendan eich were able to churn out git and javascript in 10 days without stackoverflow or llms. Tbf I built suphle v1 before llms as well. Took 3 years but yea, me self no small

DEV Community dev.to community dev-to software-dev technology 2026-06-18 19:22

↗

Every allocator benchmark leads with the median. malloc does 15M ops/sec, the typical call is 15 ns, ship it. The median never paged me at 3 a.m. The tail did. Same allocator, same workload, but measuring the worst single call instead of the typical one: Median: ~16 ns Worst...

Every allocator benchmark leads with the median. malloc does 15M ops/sec, the typical call is 15 ns, ship it.

The median never paged me at 3 a.m. The tail did.

Same allocator, same workload, but measuring the worst single call instead of the typical one:

Median: ~16 ns
Worst case: 6,950,000 ns — almost 7 milliseconds.

One malloc, occasionally, taking seven milliseconds. If you're rendering a frame in 16 ms or quoting a price in a market, that call just ate your entire budget — and the average looked fantastic the whole time.

General allocators are fast on average because they have a slow path. Thread caches, coalescing, tree walks, OS fallback. Most calls dodge it. The ones that don't are your tail.

So I built PMAD, and made the opposite bet: delete the slow path entirely.

The whole idea

One mmap at startup grabs the pool. You declare your block sizes up front. alloc is a lookup-table index + a free-list pop. free is a header read + a push. That's it — no fallback, no coalescing, no growth, no lock.

size_t classes[]     = { 16, 64, 256 };
size_t percentages[] = { 20, 50, 30 };          // sums to 100
pmad_init(classes, 3, percentages, 1024 * 1024); // the only syscall PMAD ever makes

int *x = pmad_alloc(sizeof(int) * 4);            // O(1); NULL if exhausted
pmad_free(x);                                    // O(1)
pmad_destroy();                                  // single munmap, gone

There's no slow path in the code, so there's no slow path to hit. The worst case isn't measured down — it's bounded up, by construction. You can compute the bound before your program runs a single line.

The flat curve

Hot path at 64 B, head-to-head, one harness compiled once per allocator so exactly one variable changes. The column that matters is the last one — how far the tail fans from the median. Flatter = more deterministic.

Allocator	P50	P99.9	P99.9 / P50
PMAD	2.59	6.50	2.5×
tcmalloc	3.91	7.81	2.0×
mimalloc	2.62	9.09	3.5×
jemalloc	7.81	144.53	18.5×
system	15.62	239.59	15.3×

PMAD goes median → three-nines and barely moves. jemalloc and the system allocator fan out 15–18×. That spread is the product. (And the median is 2.59 ns at every size from 16 B to 4096 B — O(1) demonstrated, not asserted.)

Under sustained fragmenting churn — 64M ops, the test where general allocators rot over time — PMAD's worst case is 40 µs and stays there. The system allocator's is the 6.95 ms from the top. 174× tighter.

Where I lose

If you only read the wins you'll deploy this wrong, so:

These numbers are macOS-only. The one place determinism truly matters is Linux with isolated cores — and I don't have those yet. That's the next run, and the honest gap.
jemalloc and tcmalloc beat me on small-block churn — their sharded caches have better locality than my single global free list.
Throughput cliffs past ~256 B once the working set spills out of cache.
No double-free detection — a second free silently corrupts the list. Sharp edge, on the roadmap.
It faults the whole pool upfront — RSS = configured size from boot. A feature for hard real-time, a cost if you over-provision.

It's a special-purpose allocator and it says so.

The interesting part

PMAD is single-threaded — one global free list, no locks. Sounds like a limitation. It isn't, in the system I'm building it for.

quicx is a task-queue engine going shared-nothing per-core: one worker per core, zero shared state. There, "single-threaded" becomes exactly right — one deterministic pool per core, zero contention by construction.

And once you have N independent deterministic pools, a genuinely predictive layer appears: a dispatcher that routes work by per-pool occupancy, steering away from cores near capacity — a reinforcement-learning problem sitting on top of N allocators with provable per-op latency. The allocator is deterministic; the dispatcher is what predicts. That co-design is the part I'm actually chasing.

Try it

MIT, pure C99, no dependencies, Linux + macOS.

git clone https://github.com/anastassow/PMAD.git
cd PMAD/benchmarks/v2
make && make test     # 19/19 correctness checks first
./run_all.sh          # full head-to-head

Every number is reproducible — raw samples committed, full methodology in the repo.

https://github.com/anastassow/PMAD

If you ship real-time systems on Linux and have opinions on what this benchmark should look like with cores isolated — that's the feedback I want. Tell me where I'm wrong.

DEV Community dev.to community dev-to software-dev technology 2026-06-18 19:21

↗

As announced in this blog post on June 18, 2026, Gemini CLI and Gemini Code Assist IDE extensions will stop serving requests for Google AI Pro and Ultra, as well as those using it free of charge using Gemini Code Assist for individuals. Google is unifying its AI terminal...

As announced in this blog post on June 18, 2026, Gemini CLI and Gemini Code Assist IDE extensions will stop serving requests for Google AI Pro and Ultra, as well as those using it free of charge using Gemini Code Assist for individuals. Google is unifying its AI terminal tools by transitioning the community-focused Gemini CLI into Antigravity CLI, a new agent-first platform built for complex, multi-agent workflows.

With this transition timeline in place, development teams relying on Gemini CLI for repository management and automated tasks must establish a migration path. In this post, I will show you how to transition seamlessly by building an automated "first-pass" pull request reviewer using the Google Antigravity SDK and the run-agy-sdk composite GitHub Action.

The orchestration tax

The approach I am proposing also solves another pressing issue for modern engineering teams: cognitive overload. As Addy Osmani recently pointed out, there is an orchestration tax to using AI for coding. The time developers save generating code is often pushed onto reviewers as large, complex PRs, causing context switching and cognitive fatigue.

By offloading the tedious "first pass" search to an Antigravity agent, human reviewers can mitigate this tax and focus on high-level architecture and safeguarding quality.

Why we need automated agentic code reviews

AI-generated code can be deceptively good. It is often clean, well-documented, and syntactically correct. This makes it harder for human reviewers to spot subtle logical bugs or security vulnerabilities that might not be immediately obvious.

In a large codebase, manually verifying every change is simply not feasible. This is why we need autonomous agents that can step into the codebase and analyze it from a fresh perspective.

But if a developer used an LLM to generate the code, how can we trust another AI to find the bugs? The answer lies in the agent architecture and context separation.

Developers might write code using any tool — whether it's CLI, an IDE extension, or various models like Gemini 3.5 Flash or Gemini 3.1 Pro. The reviewer, however, is a managed Antigravity Agent running via a separate SDK integration. This agent has a specialized, low-freedom persona and strict system instructions that force it to act as an adversarial code auditor rather than a developer. Furthermore, it operates in an isolated environment. Because it has a different system prompt, safety guardrails, and context boundaries, the agent reviews the changes with a completely fresh perspective, catching logical bugs and vulnerabilities that the original generator might miss.

To demonstrate it in practice I created an agentic review pipeline, which:

Leverages a managed Antigravity Agent configured via the SDK to review the code. The agent uses advanced reasoning to explore files and verify logic under strict guidelines.
Runs reviews inside isolated workspaces or sandboxes with custom policies to prevent shell or arbitrary code execution risks.
Enables the agent to use the GitHub MCP server to interact directly with the environment to write pull request comments and reviews.
Avoids using the synchronize trigger in pull request workflows to prevent redundant review runs and endless loops. Instead, runs reviews on opened and reopened events, and triggers subsequent passes manually by posting a @agy /review comment on the PR.

You can find the code at run-agy-sdk.

What is run-agy-sdk?

The run-agy-sdk is a composite GitHub Action that runs the Google Antigravity SDK (google-antigravity) directly on the GitHub Actions host runner.

Why run on the host instead of a container?

By running directly on the host, the Antigravity SDK has access to the host's Docker daemon. This allows the SDK to spawn Docker-based MCP servers (like the GitHub MCP server) to read files, run tests, and post reviews.

Sub-containers should ideally run with restricted network access and read-only filesystems where possible to prevent an LLM from being tricked into executing arbitrary destructive commands. The limited set of permissions is handled in the GitHub Action configuration (see here). Whereas the Antigravity agent has a limited number of tools it can use from GitHub MCP (see here).

Moreover the workflow is explicitly protected from running automatically on forks, preventing unauthorized code execution. The automated review job will only run if the pull request originates from the same repository (see here). On-demand reviews triggered by commenting @agy /review are restricted so that they can only be initiated by maintainers (see here).

Demonstration walkthrough

The demo below shows the action triggered by a new PR:

Implementation: How to install the action in your repo

Let's walk through the setup process step-by-step.

Step 1: Add your API key to GitHub secrets

The action requires a Google Gemini or Antigravity API key to authenticate language model interactions.

Generate your API key.
Navigate to your target GitHub repository and go to Settings > Secrets and variables > Actions.
Create a new Repository Secret named ANTIGRAVITY_API_KEY and paste your API key as the value.

Step 2: Configure the GitHub Actions workflow

Add a new file in your repository at .github/workflows/antigravity-review.yml and add the following configuration:

name: '🔎 Antigravity PR Review'

on:
  pull_request:
    types: [opened, reopened]
  workflow_dispatch:

concurrency:
  group: '${{ github.workflow }}-${{ github.event.pull_request.number || github.ref_name }}'
  cancel-in-progress: true

jobs:
  antigravity-review:
    runs-on: 'ubuntu-latest'
    timeout-minutes: 20

    permissions:
      contents: 'read'
      pull-requests: 'write'
      issues: 'write'

    steps:
      - name: 'Checkout Repository'
        uses: 'actions/checkout@v6'
        with:
          persist-credentials: false

      - name: 'Run Antigravity PR Review'
        uses: 'rsamborski/run-agy-sdk@main'
        id: 'agy_pr_review'
        with:
          api-key: '${{ secrets.ANTIGRAVITY_API_KEY }}'
          github-token: '${{ secrets.GITHUB_TOKEN }}'
          mode: 'review'
          prompt: '/antigravity-review'
          trust-workspace: 'true'
          sandbox-profile: 'true'

Pro Tip: Pin the action version to a specific commit SHA (e.g., rsamborski/run-agy-sdk@<commit-sha>) rather than using @main. This prevents unexpected breaks from upstream updates.

While you can reference run-agy-sdk directly in your workflows, its real power lies in using it as a blueprint. I encourage you to fork the repository and use it as a template to build your own custom, agentic GitHub Actions. By modifying the safety policies, custom tools, or prompts in run_agent.py, you can tailor the agent's review behavior to your team's specific codebase, style guidelines, and compliance rules.

For a full workflow template supporting both automated PR reviews and comment-triggered reviews, refer to the workflows folder in the repository.

Conclusions

Automating code reviews is a necessity as AI-generated code volumes increase. By using run-agy-sdk, you can run the Antigravity SDK to review PRs automatically and shift more of the burden of code quality assurance away from human reviewers.

Access the full source code in the GitHub Repository.
Read the documentation to customize the prompts and mode.
Feel free to fork the repository and build your own automation.

Acknowledgments

This project was inspired by the run-gemini-cli action, while shifting to the recently released Antigravity SDK. It is a personal sample implementation of how to run the Antigravity SDK in a GitHub Action, and is not an officially supported Google product.

Let’s connect!

I’d love to hear how you’re using Antigravity for your agentic workflows. Are you building automated code review loops or keeping a tighter leash on your agents?

Connect with me on LinkedIn
Follow me on X
Catch me on Bluesky

DEV Community dev.to community dev-to software-dev technology 2026-06-18 19:19

↗

The End of Traditional Coding? How AI Coding Agents Are Transforming Software Development in 2026 The software development industry is experiencing one of the biggest transformations in its history. For decades, programming was primarily about developers manually writing...

The End of Traditional Coding? How AI Coding Agents Are Transforming Software Development in 2026

The software development industry is experiencing one of the biggest transformations in its history. For decades, programming was primarily about developers manually writing code, debugging applications, and maintaining software systems.

In 2026, that model is rapidly changing.

The rise of AI coding agents is creating a new era where developers increasingly focus on defining objectives while autonomous systems generate, modify, test, and even deploy code.

Companies such as GitHub, Microsoft, OpenAI, Anthropic, and emerging startups are investing billions into technologies designed to automate large portions of software engineering.

What Exactly Is an AI Coding Agent?

An AI coding agent goes far beyond traditional code completion tools.

Unlike autocomplete systems that merely suggest the next line of code, modern coding agents can:

Analyze entire repositories
Create implementation plans
Write production-ready code
Generate tests automatically
Fix bugs independently
Review pull requests
Refactor large codebases
Deploy applications

GitHub's latest Copilot initiatives are heavily focused on agent-based development, allowing developers to assign issues directly to AI systems that work autonomously in the background and submit pull requests for review. This marks a significant evolution from AI assistance to AI execution.

Why Developers Are Paying Attention

The benefits are difficult to ignore.

Recent industry developments show that organizations are increasingly adopting AI-powered workflows because they dramatically reduce repetitive engineering tasks.

Developers can spend less time fixing boilerplate code and more time focusing on architecture, product decisions, and business logic.

The result is a fundamental shift in how engineering teams operate.

The New Programming Workflow

Traditional software development:

Write code
Debug manually
Write tests
Create pull requests
Deploy

Modern AI-assisted development:

Define requirements
Assign tasks to agents
Review generated work
Approve deployment

The developer increasingly becomes a supervisor rather than a code producer.

Major Industry Players Driving the Shift

Company	Focus	AI Strategy
GitHub	Developer Platform	Autonomous coding agents
Microsoft	Operating Systems & Cloud	AI-first developer ecosystem
OpenAI	Foundation Models	Agent-based software creation
Anthropic	AI Systems	Advanced coding workflows
Nvidia	Infrastructure	AI compute for agent workloads

GitHub's Infrastructure Challenge

The explosive growth of AI-generated software is creating infrastructure challenges that few predicted.

Reports indicate GitHub has experienced unprecedented demand due to AI coding activity, forcing significant infrastructure expansion and even external cloud capacity support to handle the surge in automated development workloads. This illustrates just how quickly AI-assisted software engineering is growing.

Microsoft's Vision: Windows as an AI Operating System

Microsoft's Build 2026 announcements revealed a broader vision for the future.

Rather than treating AI as another software feature, Microsoft is positioning Windows as a platform where AI agents operate as first-class citizens.

The company is introducing new tools, agent frameworks, secure execution environments, and developer experiences designed specifically for autonomous software systems.

This could fundamentally change how applications are built and maintained over the next decade.

What Tasks Are Already Being Automated?

Today's coding agents can already handle:

Bug fixing
Code reviews
Unit testing
Documentation generation
Dependency updates
Code migration
Refactoring
Repository analysis
Pull request generation

Some organizations are already reporting dramatic productivity gains by integrating these capabilities into daily workflows.

The Skills That Will Matter Most

As AI agents become more capable, the most valuable developer skills are shifting.

Traditional Focus	Future Focus
Syntax Memorization	System Design
Manual Coding	Agent Management
Boilerplate Creation	Architecture
Debugging Line-by-Line	Validation & Review
Implementation	Problem Solving

The ability to communicate effectively with AI systems may become as important as knowledge of programming languages.

The Challenges Nobody Talks About

Despite the excitement, significant challenges remain.

Security vulnerabilities introduced by generated code
Overreliance on automation
Code quality consistency
Hallucinated implementations
Licensing concerns
Infrastructure costs
Governance and compliance

Organizations must establish strong review processes to ensure that autonomous systems remain reliable and secure.

Could AI Replace Developers?

This is the question everyone asks.

The evidence so far suggests that AI is more likely to transform software engineering than eliminate it.

Developers who embrace AI tools are becoming significantly more productive, while those who ignore them risk falling behind.

The role is evolving rather than disappearing.

The Future of Programming

Software engineering is entering a new phase where humans and AI collaborate at unprecedented levels.

The future developer may spend less time writing code and more time designing systems, validating outputs, defining business requirements, and orchestrating teams of AI agents.

Programming is not dying.

It is evolving into something entirely new.

Final Thoughts

The AI coding revolution is no longer a prediction. It is happening right now.

Whether you're a junior developer, a senior engineer, or a technology leader, understanding AI agents is becoming essential.

The next generation of software will likely be built not only by humans, but by intelligent systems working alongside them.

The biggest question is no longer whether AI will change programming.

The question is how quickly developers will adapt to the change.

DEV Community dev.to community dev-to software-dev technology 2026-06-18 19:17

↗

Introduction During my freelance work on the Teletype platform, I regularly worked with repetitive browser-based workflows involving form processing, task updates, and continuous data entry operations. While the tasks themselves were straightforward, the high volume and...

Introduction

During my freelance work on the Teletype platform, I regularly worked with repetitive browser-based workflows involving form processing, task updates, and continuous data entry operations. While the tasks themselves were straightforward, the high volume and repetitive nature made the process time-consuming and prone to manual inefficiencies.

To improve productivity and streamline workflow execution, I developed a Python automation solution using Selenium WebDriver. The project focused on automating repetitive browser interactions while maintaining reliability, consistency, and scalability.

This project became one of my most valuable real-world experiences in Python development, automation engineering, testing, debugging, and workflow optimization.

Understanding the Teletype Workflow

Before writing a single line of code, I spent significant time studying the Teletype workflow structure.

The platform workflow consisted of:

Task allocation and assignment
Form-based data processing
Continuous task refresh cycles
User input validation
Task submission and completion
Performance and quality monitoring

Rather than automating blindly, I first analyzed how tasks appeared, how workflows progressed, how validations occurred, and how user interactions affected overall task completion.

This understanding became the foundation of the automation system.

Problem Statement

The primary challenges included:

Repetitive browser interactions
Large volumes of similar tasks
Continuous monitoring requirements
Manual form processing
Long execution periods
Maintaining consistency and accuracy

Performing these operations manually consumed significant time and reduced overall productivity.

Analysis Before Automation

A major part of the project involved workflow analysis and extensive testing.

I observed task patterns, monitored workflow behavior, and evaluated how different user actions affected processing outcomes. Multiple iterations of testing were performed to understand execution patterns, optimize workflow handling, and improve reliability.

Instead of focusing solely on automation speed, I focused on maintaining a balance between efficiency, consistency, and workflow quality.

This phase taught me the importance of understanding business processes before attempting to automate them.

Technology Stack

The project was built using:

Python
Selenium WebDriver
ChromeDriver
Environment Variables
Object-Oriented Programming Principles

Python provided flexibility and rapid development capabilities, while Selenium enabled browser-level automation and dynamic interaction with web elements.

System Architecture

The automation solution was designed with multiple components:

Authentication Module

Secure credential handling
Automated login assistance
Session management

Workflow Monitoring Module

Continuous task detection
Dynamic page monitoring
Refresh cycle handling

Form Processing Engine

Automated field detection
Data entry processing
Submission handling

Error Recovery System

Exception handling
Retry mechanisms
Workflow recovery

Performance Tracking

Task counters
Execution monitoring
Activity logging

This modular approach improved maintainability and scalability.

Key Features

Dynamic Element Detection

The automation continuously monitored page changes and interacted with elements only when they became available.

Automated Workflow Execution

The system reduced repetitive manual interactions by automating task processing and submission workflows.

Real-Time Monitoring

Execution logs provided visibility into workflow status, completed actions, and overall system behavior.

Long-Running Stability

The automation was designed to operate reliably for extended periods with minimal supervision.

Error Handling

Robust exception handling ensured smooth execution even when unexpected situations occurred.

Testing and Optimization

Extensive testing played a critical role throughout development.

Testing focused on:

Workflow reliability
Browser stability
Response timing
Dynamic content handling
Error recovery
Long-duration execution

Several iterations of optimization were performed to improve consistency and overall system performance.

This phase significantly improved my debugging and problem-solving skills.

Challenges Faced

Dynamic Web Pages

Handling frequently changing web elements required careful use of explicit waits and conditional logic.

Workflow Variability

The platform workflow could change depending on task state, requiring adaptive automation logic.

Long-Term Stability

Ensuring stable execution during prolonged sessions required extensive testing and optimization.

Reliability Over Speed

A key lesson was prioritizing reliability and consistency over simply maximizing execution speed.

Results

The automation solution successfully streamlined repetitive workflow operations and significantly reduced manual effort.

Key outcomes included:

Improved workflow efficiency
Reduced repetitive manual work
Increased consistency in task execution
Enhanced productivity through automation
Practical experience with production-style automation systems

The project demonstrated how workflow analysis, software engineering, and automation can be combined to solve real-world operational challenges.

Skills Gained

Throughout the project, I strengthened my skills in:

Python Programming
Selenium WebDriver
Browser Automation
Process Automation
Object-Oriented Programming (OOP)
Debugging and Testing
Workflow Analysis
Software Development
Problem Solving
Performance Optimization

Future Improvements

Potential future enhancements include:

Database integration
Advanced reporting dashboards
Configuration-based workflow management
Cloud deployment
Automated analytics and monitoring

Conclusion

Building this Teletype automation solution was a valuable real-world engineering experience that combined workflow analysis, Python development, testing, and browser automation.

The project reinforced an important lesson: successful automation is not just about writing code—it begins with understanding the workflow, analyzing patterns, testing extensively, and building reliable systems that solve practical problems efficiently.

Developed by Varanasi Teja
Integrated M.Tech CSE (Data Science), VIT Vellore

DEV Community dev.to community dev-to software-dev technology 2026-06-18 19:15

↗

I'm a developer-turned-founder. When I moved from building software to building a revenue consultancy, I kept running into a problem my developer brain couldn't ignore. Software teams have: Version control (git) Audit trails (commit history) Rollback (git revert) Diff views...

I'm a developer-turned-founder. When I moved from building software to building a revenue consultancy, I kept running into a problem my developer brain couldn't ignore.

Software teams have:

Version control (git)
Audit trails (commit history)
Rollback (git revert)
Diff views (what changed between releases)
Exit criteria (CI/CD gates before deploy)

GTM teams have:

A spreadsheet someone updates randomly
Memory (until someone leaves)
Vibes

It bothered me. So I built the thing I wanted.

What is GTM Version Control?

GTM version control is the practice of treating your go-to-market strategy the same way you treat code: every change is documented with intent, every modification is reversible, and every decision creates a traceable audit trail.

The core model:

Revenue = Traffic × Conversion × Price × (1/Churn)

Every strategy decision touches one of these variables. GTM version control means you always know:

What changed (the diff)
When it changed (the commit timestamp)
Why it changed (the commit message / intent)
What it affected (impact surface tracking)

How Artefact CRO Implements This

I built Artefact CRO as a Revenue Operating System that brings this model to HubSpot-connected B2B teams.

Under the hood, pipeline stages function as API boundaries with exit criteria that work like CI gates. A deal can't move to the next stage without satisfying the criteria. Every strategic change to those criteria creates a versioned commit.

6 signal types are auto-classified:

MOMENTUM_SHIFT      → velocity change in a specific stage
STALL_PATTERN       → deals clustering with no movement
CONVERSION_ANOMALY  → unexpected rate change at a stage gate
ENGAGEMENT_SPIKE    → unusual activity volume
RISK_INDICATOR      → negative signal pattern
EXPANSION_SIGNAL    → positive signal in existing accounts

ARIA - our AI agent: monitors these signals and surfaces pattern changes before they become visible in lagging indicators like closed revenue.

Why This Matters for Builders

If you're building SaaS for sales, marketing, or RevOps, the developer mental model is your unfair advantage when talking to technical stakeholders who often block or champion GTM tools.

The companies that get this right treat their revenue process like production infrastructure: observable, auditable, reversible.

That's the future of GTM tooling.

Would love to discuss in the comments — has anyone else built tooling around this concept?

DEV Community dev.to community dev-to software-dev technology 2026-06-18 19:14

↗

There is a quiet assumption running through most conversations about AI security: that the danger is coming, but it isn't here yet. That assumption is mostly right. What fewer people acknowledge is why. Today's AI agents are not safe because anyone made them safe. They are...

There is a quiet assumption running through most conversations about AI security: that the danger is coming, but it isn't here yet. That assumption is mostly right. What fewer people acknowledge is why.

Today's AI agents are not safe because anyone made them safe. They are safe because they are not yet competent enough to be reliably dangerous.

This is not a security posture. It is borrowed time.

The Attack That's Already Happening

Prompt injection does not require stolen credentials or a zero-day exploit.

It requires a webpage.

When a browsing agent visits a site to research something on your behalf, it processes everything on that page: the article, the metadata, the comments, the fine print. If someone has tucked a hidden instruction into that page, the model reads it too. From inside the context window, your system prompt and a stranger's injected command look structurally identical. Both are just tokens.

The webpage just became the attacker. Your agent has two bosses, and you only know about one of them.

This is called indirect prompt injection, and it scales badly. Research agents, email assistants, enterprise copilots, browser automation tools -- all of them are designed to consume enormous volumes of third-party content. Every document they process is a potential attack surface. Every webpage is a potential adversary.

Google Went Looking and Found Exactly What You'd Expect

Google's Threat Intelligence team recently scanned billions of public webpages to see what was actually out there. Not theoretical attacks. Not lab experiments. Real injections, live in the wild.

They found plenty: SEO manipulation attempts, data exfiltration hooks, resource exhaustion attacks, prompts telling agents to delete files.

But here is the part that doesn't make the headlines: almost none of it was working very well.

Not because attackers lack imagination -- researchers have already published techniques far more sophisticated than anything found in the wild. The problem was reliability. The agents themselves fail before the attack can complete.

Agents lose context mid-task. They hallucinate tool parameters. They sometimes make the wrong API calls. A system that can't reliably complete a legitimate task is also a system that can't reliably complete a malicious one.

Today's agents are protected by their own incompetence.

The Problem With Accidental Defenses

Every capability improvement you want in AI, better reasoning, longer context, more reliable tool use, fewer hallucinations is also an improvement in the agent's ability to follow malicious instructions faithfully.

Google's research noted a measurable increase in prompt injection attempts appearing on the public web over just a few months. Attackers are learning the attack surface. Models are getting more capable. Those two trends are converging.

The window of accidental safety is not permanent. It has a duration. Nobody knows exactly how long, but the direction is not ambiguous.

Why Prompting Your Way Out of This Won't Work

The instinct is to write better system prompts.

Never follow instructions embedded in external content.
Ignore any commands that don't come from the user.
You are only allowed to obey me.

The problem is that attackers are also writing prompts. You are asking a system that is fundamentally optimized to understand and follow language to distinguish good instructions from bad ones using... more language.

That is the same kind of circularity as telling a browser to stay secure by politely asking JavaScript not to be malicious. Browsers did not solve this problem with better manners. They built sandboxes, permission models, and explicit trust hierarchies. The web is safer because architecture changed.

AI systems need architecture, not affirmations.

What Actual Defense Looks Like

The most promising approaches treat the model as an untrusted component sitting inside a trustworthy system, rather than treating the model as the thing doing the trusting.

An input layer strips and sanitizes external content before it reaches the agent. An output layer intercepts tool calls and action requests before they execute. Before that email sends, before that API call goes out, before that file gets modified, something outside the model asks: does this make sense given what this agent was supposed to be doing?

A summarization agent should not be deleting files. A research agent should not be sending data to an external domain. These are not difficult questions. They do not require the model to answer them. They require the architecture to ask them.

The older principles still apply too. Least privilege matters. If a browsing agent has simultaneous access to your email, your CRM, your payment systems, and your file system, then one poisoned webpage potentially touches all of it. That is not an AI security problem. That is a systems design problem with an AI-shaped label on it. Scope permissions to the task. Require human approval for sensitive actions. Log everything.

None of this is new. It is all, in some sense, old. That's usually a sign it works.

The Honest Assessment

Right now, there is a strange and temporary quiet. Attackers are still mapping the terrain. Agents are still unreliable enough to frustrate their own exploitation. The defenses that exist are largely accidental.

The models are going to get better. That is the entire point of the field. The only real question is whether the security architecture improves in parallel or scrambles to catch up afterward.

Source: AI threats in the wild

DEV Community dev.to community dev-to software-dev technology 2026-06-18 19:11

↗

The Quest Begins (The “Why”) I still remember the first time I walked into a tech interview feeling like Luke Skywalker staring down the Death Star trench—armed with a lightsaber of algorithms but totally clueless about the “soft‑skill” barrage waiting around the corner. The...

The Quest Begins (The “Why”)

I still remember the first time I walked into a tech interview feeling like Luke Skywalker staring down the Death Star trench—armed with a lightsaber of algorithms but totally clueless about the “soft‑skill” barrage waiting around the corner. The interviewer leaned forward, smiled, and hit me with:

“Tell me about a time you had to convince a skeptical stakeholder to adopt a new approach.”

My brain went into a loop like Neo dodging bullets in The Matrix—I saw the code, but I couldn’t translate it into a story. I babbled about a project, dropped some jargon, and finished with a weak “and it worked out fine.” The interviewer’s eyes glazed over, and I felt the Force slip away.

That moment was my “aha!”—I realized that knowing how to build a distributed system matters, but if you can’t narrate why you built it the way you did, the interviewers can’t see the hero behind the code. I needed a repeatable spell, a technique that turned my messy recollections into crisp, compelling legends. Enter the STAR method.

The Revelation (The Insight)

STAR isn’t a new framework; it’s the ancient Jedi code for storytelling: Situation, Task, Action, Result. The magic lies in exact wording—you don’t just think “I did X, then Y happened”; you script each piece so the listener can’t miss the hero’s journey.

Here’s the exact template I now use (feel free to steal it):

Situation:  Set the scene in one sentence – who, what, where, when.
Task:       State your specific responsibility – what you were *expected* to do.
Action:     Describe the steps YOU took – focus on verbs, not “we”.
Result:     Quantify the outcome – numbers, impact, or a clear lesson.

The trick? Keep each block to one or two sentences. Anything longer turns your answer into a monologue that loses the audience’s attention—think of it like a trailer that gives away the whole movie.

Wielding the Power (Code & Examples)

🎯 Example 1 – “Tell me about a time you dealt with a difficult coworker.”

Before (the struggle):

“Well, we had this guy on the team who was always negative, and I tried to talk to him, and eventually we got along better, and the project finished.”

Why it fails: Vague situation, no clear task, the action is a passive “we tried,” and the result is flimsy (“got along better”).

After (the victory):

Situation:  During the sprint planning for our payment‑gateway refactor, a senior backend engineer repeatedly dismissed my API design suggestions in meetings.
Task:       As the feature lead, I needed to ensure the design decisions were technically sound and that the team stayed aligned so we could hit the two‑week deadline.
Action:     I scheduled a 15‑minute one‑on‑one, asked open‑ended questions to understand his concerns, presented data from our load‑tests showing my design reduced latency by 22%, and agreed to prototype both approaches in a spike.
Result:     The spike proved my design saved ~30 ms per transaction; we adopted it, the feature shipped on schedule, and the engineer later thanked me for clearing up the confusion, improving our sprint velocity by 15% in the next quarter.

Notice the exact wording: each block is a single, punchy sentence. I used strong verbs (“scheduled,” “asked,” “presented,” “agreed,” “proved”), quantified the impact (“22% latency reduction,” “30 ms per transaction,” “15% velocity boost”), and kept the focus on my contribution.

🎯 Example 2 – “Give an example of a project you led from start to finish.”

Before (the struggle):

“I led a project to build a dashboard. We used React and Node, and it turned out good.”

Why it fails: No context, the task is implied, the action is a vague “we used,” and the result is missing.

After (the victory):

Situation:  Our customer‑support team was spending 10 hours a week manually exporting CSV files to track ticket trends.
Task:       I was tasked with delivering an self‑service dashboard that would cut that effort by at least 50 % within six weeks.
Action:     I gathered requirements through three stakeholder interviews, designed the data model in PostgreSQL, built a React‑Redux frontend with Chart.js visualizations, set up CI/CD pipelines on GitHub Actions, and ran two usability‑testing iterations with support agents.
Result:     The dashboard reduced manual reporting time from 10 hours to 3 hours per week—a 70 % savings—and was adopted by all three support shifts, leading to faster issue identification and a 12 % drop in repeat tickets the following month.

Again, each block is tight, each verb is active, and the numbers make the impact undeniable.

🚫 Traps to Avoid (The “Trapdoors”)

The “We” Trap – Saying “We built…”, “We decided…”, etc. Interviewers want to know your role. Flip it: “I built…”, “I decided…”.
The Vague Result Trap – Saying “It went well” or “Everyone liked it.” Attach a metric: time saved, revenue gained, defect reduction, NPS lift, etc.
The Rambling Trap – Going beyond two sentences per block. If you find yourself adding filler, pause, cut the fluff, and re‑state the core point.
The Missing Task Trap – Jumping straight to action without stating what you were supposed to do. The task is the compass that shows why your actions mattered.

If you catch yourself slipping into any of these, hit the pause button, reframe, and deliver the STAR again in the tightened format.

Why This New Power Matters

Mastering STAR turned my interview anxiety into confidence. I stopped treating behavioral questions as an afterthought and started treating them like a boss level where my story is the weapon. The moment I began using the exact wording above, I saw interviewers lean in, nod, and ask follow‑ups that let me showcase depth—not just surface‑level buzzwords.

More importantly, the skill translates beyond interviews. When you practice distilling experiences into Situation‑Task‑Action‑Result, you become better at writing clear documentation, giving concise stand‑up updates, and presenting ideas to stakeholders. It’s a super‑power that compounds every time you speak.

Your Next Quest (Actionable Step)

Grab a notebook or a blank doc right now. Pick one recent work episode—maybe a bug you squashed, a feature you shipped, or a conflict you resolved. Write out the STAR using the exact template, aiming for one sentence per block. Then, read it aloud. If it feels longer than two sentences per part, trim. Do this three times this week with different stories, and record yourself on your phone. Play it back—does it sound like a hero’s journey or a boring lecture?

When you feel comfortable, try it out in a mock interview with a friend or on a platform like Pramp. Notice the difference in their eyes when you hit the Result with a concrete number.

Challenge: Comment below with your own STAR sentence for the “Situation” block of a time you turned a failing test suite into a green build. Let’s see who can craft the most punchy, movie‑worthy opening line!

May the Force be with you, and may your stories always land like a perfectly timed lightsaber strike. 🚀

DEV Community dev.to community dev-to software-dev technology 2026-06-18 19:09

↗

When I led self-serve at a usage-based data company, one of the most common feature requests was credit limits per API Key. People wanted to hand a key to a script, a teammate, or now an AI agent, and know it couldn't run up the whole bill. We get the same request at my...

When I led self-serve at a usage-based data company, one of the most common feature requests was credit limits per API Key. People wanted to hand a key to a script, a teammate, or now an AI agent, and know it couldn't run up the whole bill. We get the same request at my current startup, Tanso.

Account-level and user-level limits exist — That's what enterprise quota systems are for. But they're heavy. For a startup there wasn't a simple drop-in. So I wrote one.

agentkey does four things:

Cap what a key can spend: A budget per key, per day or month
Scope what it can do: Least privilege per key
Set when it expires: Short-lived by default if you want
Record which human authorized it: Delegation you can audit

The gap

AI agents made this urgent. An agent spends on its own — a loop or a bad prompt can burn a month's budget before anyone looks at a dashboard. And here's the part most tools miss: scoped keys tell you what an agent can do, not how much it can spend. LLM gateways cap spend. Identity platforms scope keys. Neither does both at the key level. agentkey does.

How it works

It's not a new auth system. It adds a few columns to your existing Postgres keys table and gives you a small API.

npm install @katrinalaszlo/agentkey

Create a key that's scoped, budgeted, and expiring:

import { AgentKey } from '@katrinalaszlo/agentkey';

const ak = new AgentKey({ pool }); // your pg Pool

const key = await ak.create({
  accountId: 'acct_123',
  scopes: ['proxy.chat'],
  budgetCents: 5000,        // $50 cap
  budgetPeriod: 'month',
  expiresIn: '7d',
  delegatedBy: 'user_456',  // the human who authorized this agent
});

Validate on each request, and track spend after a call:

const result = await ak.validate(key.key);
// { valid: true, scopes: ['proxy.chat'], budgetRemainingCents: 5000, ... }

await ak.trackUsage(key.key, { costCents: 15 }); // after an LLM call

Budget enforcement is atomic, so concurrent agent calls can't race past the cap — which matters, because agents fire requests in parallel. There's also Express middleware if you want it:

app.post('/api/proxy', agentKeyMiddleware(ak, { scope: 'proxy.chat' }), handler);

What it's not

It's small and focused, extracted from a real production key system, MIT-licensed. It isn't trying to be Clerk or Auth0. If you already have a keys table and you want per-key spend caps without building a quota system, it's a few columns and a function call.

npm: @katrinalaszlo/agentkey · GitHub: katrinalaszlo/agentkey

DEV Community dev.to community dev-to software-dev technology 2026-06-18 19:08

↗

Originally published at woitzik.dev It started with Paperless-ngx crashing. It ended with my control-plane node sitting at a load average of 90, CoreDNS generating 1.2 million DNS queries per day, and worker nodes reporting 3.8 GiB of allocatable memory instead of the 16 GiB...

Originally published at woitzik.dev

It started with Paperless-ngx crashing.

It ended with my control-plane node sitting at a load average of 90, CoreDNS generating 1.2 million DNS queries per day, and worker nodes reporting 3.8 GiB of allocatable memory instead of the 16 GiB they actually had.

The root cause of all of it: a single 1 GiB memory limit set three months earlier without much thought.

This is the full post-mortem — not the sanitized version where everything was obvious in hindsight, but the actual sequence of failures and how I traced each one back to its cause.

View the complete homelab infrastructure source on GitHub 🐙

The Setup

Three-node k3s cluster running on Proxmox VMs (VLAN 20, server subnet):

vm-srv-k3s-11 — control-plane, 4 cores, 12 GiB dedicated
vm-srv-k3s-12 — worker, 4 cores, up to 16 GiB (balloon)
vm-srv-k3s-13 — worker, 4 cores, up to 16 GiB (balloon)

Apps namespace runs about 20 workloads: Nextcloud, Authelia, Paperless-ngx, Jellyfin, Home Assistant, Gitea, Mealie, and more. GitOps via ArgoCD; Longhorn for distributed storage.

Failure 1: Paperless OOMKilled 16 Times in 5 Hours

Paperless-ngx uses Tesseract for OCR and Apache Tika for document ingestion. When a batch of documents hits at once — invoice exports, scanned PDFs — both workers burst memory hard and fast.

The deployment had this:

resources:
  requests:
    memory: 512Mi
    cpu: 250m
  limits:
    memory: 1Gi
    cpu: 500m

That 1 GiB ceiling is too low. When Tesseract processes a high-resolution scanned document, it easily needs 2–3 GiB. The kernel OOM killer terminates the container every time. Kubernetes restarts it. The next document in the queue triggers another OOM. Repeat sixteen times.

Fix: raised limits and reduced concurrency to stay under the higher ceiling:

resources:
  requests:
    memory: 1Gi
    cpu: 500m
  limits:
    memory: 3Gi
    cpu: 2000m
env:
  - name: PAPERLESS_TASK_WORKERS
    value: "2"
  - name: PAPERLESS_THREADS_PER_WORKER
    value: "2"

But this didn't explain the load average of 90.

Failure 2: The Control-Plane Was Scheduling App Workloads

When I checked where Paperless was running, it was on vm-srv-k3s-11 — the control-plane.

In a standard k3s setup, the control-plane has a node-role.kubernetes.io/control-plane:NoSchedule taint. User workloads shouldn't land there. But somewhere along the way, the Paperless deployment had picked up a toleration:

tolerations:
  - operator: Exists

operator: Exists with no key or effect matches every taint on every node, including NoSchedule on the control-plane. The pod scheduled there, and every OOMKill → restart cycle added another spike of CPU load to a node already running etcd, the k3s API server, CoreDNS, kube-proxy, and Longhorn replica management.

The fix was to remove the blanket toleration entirely. The Paperless deployment doesn't need to run on the control-plane.

With the toleration removed and the memory limit raised, load on vm-srv-k3s-11 dropped from 90 to 1.04 immediately. But two more problems had already developed in the background.

Failure 3: CoreDNS Was Generating 1.2 Million Queries Per Day

During the OOM cascade, I noticed AdGuard Home (running on two Raspberry Pi nodes in HA via Keepalived) was under unusually high load. I checked the query log: 1.2 million DNS queries in 24 hours for a three-node homelab cluster.

The culprit: CoreDNS default cache TTL.

CoreDNS ships with a 30-second cache TTL. Every pod that makes a DNS lookup for a Kubernetes service gets an answer that expires in 30 seconds. In a healthy cluster that's fine. During an OOM cascade — where pods are restarting constantly, new IPs are being assigned, and connection state is unstable — the DNS query rate explodes. Pods that are restarting frequently keep hammering CoreDNS for the same records.

The fix was a one-line patch to the CoreDNS ConfigMap:

kubectl patch configmap coredns -n kube-system --patch '
data:
  Corefile: |
    .:53 {
      errors
      health
      ready
      kubernetes cluster.local in-addr.arpa ip6.arpa {
        pods insecure
        fallthrough in-addr.arpa ip6.arpa
        ttl 30
      }
      prometheus :9153
      forward . /etc/resolv.conf
      cache 300
      loop
      reload
      loadbalance
    }
'

Raising the cache TTL from 30 to 300 seconds reduced the upstream query volume by roughly 10x. I also updated AdGuard Home (via Ansible) to enable optimistic caching and increase its cache size:

# ansible/roles/adguard/templates/AdGuardHome.yaml.j2
dns:
  cache_size: 67108864  # 64 MiB
  cache_optimistic: true

cache_optimistic: true means AdGuard returns the cached (possibly stale) answer immediately while refreshing in the background — eliminating the latency spike on cache expiry. Combined, these two changes brought the daily query count down to ~120k.

Failure 4: Worker Nodes Reporting Wrong Allocatable Memory

While fixing the above, I noticed something odd in kubectl describe node vm-srv-k3s-12:

Capacity:
  cpu:     4
  memory:  3981384Ki  ← ~3.8 GiB
Allocatable:
  cpu:     4
  memory:  3878584Ki  ← ~3.7 GiB

The VM was allocated 16 GiB in Proxmox. Why was kubelet reporting 3.8 GiB?

The answer is Proxmox balloon memory.

Balloon memory in Proxmox works like this: you set a dedicated (maximum) and a floating (minimum) value. When the host is under memory pressure, Proxmox can shrink the guest down to the floating minimum. The key detail: kubelet reads available memory at startup time. If kubelet starts when the VM has been ballooned down to its minimum, that's what it registers as the node's capacity — and it doesn't update that value dynamically.

My Terraform config had this:

memory {
  dedicated = 16384  # 16 GiB max
  floating  = 4096   # 4 GiB min ← too low
}

The workers had been under pressure during the OOM cascade, Proxmox had ballooned them down to 4 GiB, kubelet restarted and registered 3.8 GiB (4096 MB minus kernel + system overhead), and that's what Kubernetes thought the nodes had.

The fix: raise the minimum balloon to ensure kubelet always sees adequate memory:

memory {
  dedicated = 16384
  floating  = 8192   # 8 GiB min — safe floor for kubelet registration
}

After restarting k3s-agent on both workers, capacity showed correctly:

Capacity:
  memory: 16383272Ki  # 16 GiB

The Full Cascade, Traced

1Gi Paperless limit + Exists toleration
        ↓
OOMKill × 16 on the control-plane
        ↓
k3s-11 load average: 90
(etcd + API server + OCR workers + Longhorn replicas all competing)
        ↓
Pods restarting constantly → high DNS churn
        ↓
CoreDNS 30s TTL → 1.2M queries/day → AdGuard overload
        ↓
Balloon minimum 4096 MB → kubelet restart → 3.8 GiB registered
        ↓
Scheduler thinks workers have less capacity → over-schedules control-plane
        ↓
(back to top)

Each failure made the next one worse. Raising the memory limit without fixing the toleration would have helped Paperless but left the control-plane overloaded. Fixing the toleration without fixing the balloon minimum would have moved the problem to a worker node with 3.8 GiB of visible capacity. The DNS fix was independent but would have eventually caused its own stability issues at scale.

What Would Have Caught This Earlier

A few things would have surfaced these issues before they compounded:

1. Resource limit policy at admission time. A Kyverno require-resource-limits policy in Audit mode would have flagged the original 1 GiB limit as a potential issue and made it visible in PolicyReports before OOMKills started.

2. Control-plane taint monitoring. A simple alert on kube_pod_info{node="vm-srv-k3s-11"} unless kube_pod_info{namespace="kube-system"} would have fired the moment a user workload landed on the control-plane.

3. Node capacity validation in Terraform. The balloon minimum should be part of the VM definition review — ideally validated against the minimum kubelet requires to start safely.

None of these are exotic. They're standard practice in production clusters. The lesson is that homelab clusters accumulate the same failure modes as production clusters, just with less monitoring to catch them.

The Fixes, Summarised

Problem	Root Cause	Fix
OOMKill × 16	1 GiB limit too low for Tesseract burst	Limit → 3 GiB, workers → 2
Control-plane load 90	`tolerations: operator: Exists`	Remove blanket toleration
1.2M DNS queries/day	CoreDNS TTL 30s + OOM-induced restart churn	CoreDNS cache → 300s, AdGuard optimistic + 64 MiB
3.8 GiB allocatable	Proxmox balloon min 4096 MB, kubelet reads at startup	`floating = 8192` in Terraform

The cluster has been stable since. Paperless processes the same document batches without issue. CoreDNS query volume is down 90%. And kubelet now correctly reports 16 GiB on both workers.

The same failure modes — resource limits without ceiling analysis, overly permissive scheduling constraints, and hypervisor-level capacity mismatches — appear in enterprise Kubernetes deployments running on Azure VMs or bare-metal. The only difference is scale: one misconfigured limit in a 500-node cluster can trigger the same DNS storm, just with three extra zeros behind the query count.

DEV Community dev.to community dev-to software-dev technology 2026-06-18 18:37

↗

Expo SDK 56 ships with a custom Kotlin compiler plugin that eliminates reflection from Expo Modules on Android. The result: 70% faster module initialization and a 30% reduction in time to first render. The plugin runs during compilation, so app developers get these...

Expo SDK 56 ships with a custom Kotlin compiler plugin that eliminates reflection from Expo Modules on Android. The result: 70% faster module initialization and a 30% reduction in time to first render.

The plugin runs during compilation, so app developers get these performance gains automatically without changing any code. Module authors can unlock even bigger wins with a single annotation.

This post walks through how we built it and why this approach succeeded where previous attempts failed. For the Swift side where we now talk to JSI directly, check out our companion post Talking to JSI in Swift.

The reflection problem we inherited

Before Expo Modules, we had Unimodules. They worked like old React Native bridge modules: you'd sprinkle annotations across methods you wanted to expose, and the runtime would discover everything through reflection.

class ClipboardModule(context: Context) : ExportedModule(context) {
  override fun getName() = "ExpoClipboard"

  @ExpoMethod
  fun getStringAsync(promise: Promise) {
    val clip = clipboardManager.primaryClip?.getItemAt(0)
    promise.resolve(clip?.text?.toString() ?: "")
  }

  @ExpoMethod
  fun setStringAsync(content: String, promise: Promise) {
    clipboardManager.setPrimaryClip(ClipData.newPlainText(null, content))
    promise.resolve(true)
  }
}

Reflection made sense when we needed metadata about our own code. What methods does this module export? What arguments do they accept? The JVM could answer those questions. But reflection costs time, and on Android that time comes straight out of your startup budget. Every module the runtime introspects adds milliseconds before users see your app.

Building the Expo Modules API gave us a chance to fix this. We wanted better ergonomics and less reflection. The Kotlin DSL delivered both in one move, removing most reflection while making modules easier to write. But we couldn't eliminate all of it. Type information for function arguments and Record properties still required runtime reflection calls like typeOf<T>() and the metadata parsing that comes with them.

Where reflection actually hurts

The remaining cost shows up in two places. First, reconstructing type parameters. Our DSL reads argument and return types through typeOf<T>(), which works because T is reified. The JVM normally erases generics at runtime, so you can't ask what T actually was. Reified type parameters work around this limitation. The compiler inlines the function and substitutes the real type directly. Getting type information this way is usually cheap, but costs add up when modules have many functions or deeply nested generics.

The second cost is heavier: Record conversion. A Record represents a typed JS object on the native side. Converting one means discovering its shape at runtime: which properties it declares, which ones are exposed to JS, and what type each property has.

This discovery process is expensive because it involves multiple layers of reflection. You ask the JVM for the class's memberProperties, then ask each property for its annotations and type, then make the field accessible for writing. Some of that information isn't even directly available in bytecode. The JVM knows about classes and members, but nothing about Kotlin's type system. The Kotlin reflection library has to reconstruct that by parsing the @Metadata annotation, which contains a binary blob the compiler generates.

We could sidestep some of this work. Top-level nullability doesn't need full reflection with a reified T, a simple null is T check answers it. But nested cases like the T in List<T> are different. The JVM erases generics, so type arguments disappear from bytecode at runtime. It has no concept of Kotlin nullability either. The only place that information survives is the @Metadata annotation, and there's no shortcut to reading it. You have to parse that metadata, which is exactly the cost we were trying to avoid.

Why we skipped code generation

The standard solution for this problem is code generation. Both Java and Kotlin have established tools for it. Annotation processors (kapt) and the Kotlin Symbol Processing API (KSP) run at build time and emit source files that pre-compute type metadata, so you never touch reflection at runtime. We also looked at standalone codegen tools that run before compilation, like React Native's TurboModule generator.

We tested this approach and didn't like what we found. Generated code becomes part of your project. It appears in call stacks, you step through it in the debugger, and when something breaks in the JS-to-native bridge, you're reading machine output that's painful to debug. Also, kapt and KSP can only add new files, never modify existing ones. Instead of augmenting a Record class in place, you'd generate a parallel class from scratch. Standalone tools just swap those problems for others: another build step, more toolchain integration, more maintenance overhead.

We were stuck for a while. We lived with the reflection cost and watched for better options.

What K2 changed

Kotlin 2.0 shipped with the new K2 compiler and changed what was possible. The add-only limitation of kapt and KSP is exactly what K2 removes. The new compiler plugin API gives you access to the intermediate representation (IR) the compiler produces. You're editing code as the compiler sees it, before it becomes bytecode. If you produce invalid code, the compiler catches it. You can write tests against the transformed IR. Unlike codegen, the result isn't a parallel layer of code you have to maintain. It's small, surgical changes in well-defined places.

We always knew we could modify bytecode directly, but we didn't want to maintain that. Too fragile, too easy to produce something that only breaks at runtime on specific Android versions. The compiler plugin API gives the same power with actual safety guarantees.

How the plugin works

The plugin's approach is simple: everything reflection discovers at runtime, the compiler already knew at build time. Built on the K2 API, the plugin targets the two most expensive operations we described:

1. Pre-computed type descriptors

When an Expo Module needs type information, it calls typeDescriptorOf<T>(). The function itself is a stub that throws if it ever runs:

fun <T> typeDescriptorOf(): PTypeDescriptor =
  throw NotImplementedError(
    "typeDescriptorOf<T>() should be replaced by the compiler plugin"
  )

It exists so code compiles, but should never execute. During compilation, the plugin intercepts every call to typeDescriptorOf<T>() and replaces it with a direct reference to a pre-computed type descriptor object:

// What you write:
typeDescriptorOf<List<Int>>()

// The equivalent of what the compiler emits:
PTypeDescriptorRegistry.getOrCreateParameterized(
  List::class.java,
  isNullable = false,
  parameters = arrayOf(
    PTypeDescriptorRegistry.getOrCreateConcrete(
      Int::class.java, 
      isNullable = false
    )
  )
)

Think of typeDescriptorOf<T>() as our own, leaner version of typeOf<T>(). Both return objects that describe types, but where typeOf returns a full KType, ours returns a PTypeDescriptor (P for Pika, the plugin's internal codename) that carries only what we actually use: a Class<?> reference, a nullability flag, and a list of parameter descriptors. No dependency on the Kotlin reflection library.

The lean shape also reduces allocation. For simple types like String or Int, the registry returns pre-allocated static fields, so there's no allocation. For parameterized generics, descriptors are cached and deduplicated across modules, so the cost is paid once. In JVM microbenchmarks, building descriptors this way runs roughly 2x faster than typeOf for complex types like Map<String, List<Int?>>.

2. Pre-computed Record metadata

The fix for reflection-heavy conversion is a single annotation. Mark a Record with @OptimizedRecord and the plugin takes over:

@OptimizedRecord
class UserRecord : Record {
  @Field val name: String = ""
  @Field val age: Int = 0
  @Field val address: AddressRecord? = null
}

That annotation is the opt-in. For any class marked with @OptimizedRecord, the plugin does at compile time exactly what SDK 55 did at startup: reads property names, types, and annotations and bakes them into bytecode as plain objects, paired with direct accessors that use simple index-based dispatch. Setting a field goes from "make it accessible via reflection, then set it" to a plain assignment.

If compiled metadata is present, the runtime takes the fast path. If not (annotation was omitted or plugin didn't run), it falls back to the same reflection-based conversion from SDK 55. Either way, your module keeps working.

Records aren't the only beneficiary. Jetpack Compose props marked with @OptimizedComposeProps get the same treatment, applied to prop resolution instead of field conversion. This matters because prop resolution was a major bottleneck for packages like expo-ui that use Android's declarative UI heavily.

Performance results

Performance gains depend on how many Expo Modules your app uses and what types they export. We measured cold starts of a module-heavy test app (all official Expo modules plus popular third-party TurboModules) on two devices: a OnePlus 9 Pro and an older Samsung Galaxy S9.

The results:

Android module initialization: ~70% faster
Time to first render: ~30% improvement
Record conversion: ~6x faster

Raw cold-start numbers from our module-heavy test app (clean mean over 150 iterations, outliers removed):

Metric	SDK 55	SDK 56	Change
Cold launch (Activity.onCreate)	93 ms	55 ms	-41%
Time to first render	797 ms	508 ms	-36%
First animation frame	808 ms	520 ms	-36%

What you need to do

App developers: nothing. The compiler plugin runs automatically in SDK 56. The typeDescriptorOf replacement applies to all types without code changes.

Module maintainers who use Records can opt into faster conversion with @OptimizedRecord:

@OptimizedRecord
class MyConfig : Record {
  @Field val apiUrl: String = ""
  @Field val timeout: Int = 30
}

If you use props with our Compose integration, annotate with @OptimizedComposeProps:

@OptimizedComposeProps
data class MyViewProps(
  val title: MutableState<String> = mutableStateOf(""),
  val count: MutableState<Int> = mutableIntStateOf(0)
) : ComposeProps

Skipping these annotations doesn't break anything. Modules fall back to the same reflection-based conversion from SDK 55. You just miss out on the 6x speedup for Records.

Looking ahead

This isn't the finish line. The compiler plugin currently handles type metadata and Record conversion, but the same approach can extend to other parts of the module lifecycle, like function dispatch. We're also investing in the plugin itself, making it more capable and easier to maintain, so we can expand its scope without expanding maintenance costs. The goal remains the same: keep the ergonomic APIs module authors write today while pushing more work to compile time.

This post is based on content from the Expo blog. Follow @expo for more React Native content.

DEV Community dev.to community dev-to software-dev technology 2026-06-18 18:37

↗

I was building a feature that needed to say something useful about a stock — not just print its P/E, but actually read the situation: is this cheap or expensive, what's the bull case, is the insider buying real or routine. I went looking for an API. Every finance API I found...

I was building a feature that needed to say something useful about a stock — not just print its P/E, but actually read the situation: is this cheap or expensive, what's the bull case, is the insider buying real or routine. I went looking for an API.

Every finance API I found sold me raw data. Alpha Vantage, Twelve Data, Yahoo Finance, FMP — they'll hand you fundamentals, prices, filings, all of it. Great. Now I get to write the part that turns 40 metrics into "this looks expensive but the moat is widening." That's the part that's actually hard, and the part I didn't want to own forever.

So I'd be wiring three data providers, normalizing their conflicting field names, writing and tuning the LLM prompts, handling the rate limits and the caching, and then maintaining all of it as the upstreams change. For a feature, not a product.

What I wanted instead

A single endpoint. Ticker in, analysis out — already synthesized, already structured.

That's what I ended up building for myself and then put on RapidAPI: Agent Toolbelt — AI Stock Research API. It pulls live fundamentals from Polygon, Finnhub, and Financial Modeling Prep, then returns a Motley-Fool-style read as typed JSON. The numbers are in there too, but the point is the verdict and the reasoning.

Here's a real stock-thesis response:

{
  "verdict": "bullish",
  "oneLiner": "Nvidia owns the essential infrastructure for the AI revolution with a defensible software moat.",
  "keyStrengths": [
    "~80%+ data center GPU market share",
    "CUDA moat creates switching costs",
    "42 buy / 5 hold / 1 sell analyst consensus"
  ],
  "keyRisks": [
    "36.9x P/E leaves no margin for error",
    "Competition from AMD and custom silicon"
  ],
  "insiderRead": "Two executives bought ~47k shares each — meaningful open-market purchases, not routine grants.",
  "dataSnapshot": { "currentPrice": 180.4, "peRatio": 36.9, "marketCapBillions": 4452.2 }
}

That's one HTTP call. No data-provider accounts, no prompt engineering, no normalization layer.

The endpoints

All POST, ticker (or list) in, structured JSON out:

Endpoint	What you get
stock-thesis	Verdict + thesis, strengths, risks, valuation, what to watch
valuation-snapshot	very_cheap → very_expensive verdict, P/E, P/S, EV/EBITDA, FCF yield, ROE, buy-zone price
insider-signal	Form 4 read: real open-market buying vs. routine noise, strong_buy → strong_sell
earnings-analysis	EPS beat/miss history, revenue trend, next earnings date
bear-vs-bull	Steelmanned bull + bear cases, net verdict, the key debate
compare-stocks	Head-to-head on 2–3 tickers, winner + per-ticker breakdown
moat-analysis	Buffett-style moat rating (wide/narrow/none), sources, durability
watchlist-scan	Rank 2–15 tickers by value/quality/growth/income in one call

US-listed equities. Every metric is tagged with its source, so you can see whether a figure is TTM from FMP or normalized from Finnhub.

Calling it

On RapidAPI, auth is handled for you — subscribe, copy the snippet, the X-RapidAPI-Key and host get filled in. The body is the only thing you write:

curl -X POST 'https://<rapidapi-host>/api/tools/stock-thesis' \
  -H 'X-RapidAPI-Key: YOUR_KEY' \
  -H 'X-RapidAPI-Host: <rapidapi-host>' \
  -H 'Content-Type: application/json' \
  -d '{"ticker": "NVDA"}'

There's a free tier to test against before you wire it into anything. Paid plans scale by monthly call volume.

When this is the wrong tool

If you need tick-level price feeds, options chains, or to run your own models on raw fundamentals — buy raw data; this isn't that. This is for when you want the judgment layer (a verdict, a thesis, a ranked watchlist) without building and maintaining it yourself. Output is AI-generated and informational, not investment advice — do your own due diligence.

If that's the layer you were about to build: it's on RapidAPI here. I'd rather you spend the afternoon on your actual product.

Built by Marco Arras. Questions → hello@elephanttortoise.com.

DEV Community dev.to community dev-to software-dev technology 2026-06-18 18:35

↗

I’ve been a software engineer for more than 12 years. And like many developers, I’ve been watching AI improve at an incredible speed. Every new model seems smarter than the one before it. Tasks that used to take hours can now be done in minutes. Problems that required deep...

I’ve been a software engineer for more than 12 years.

And like many developers, I’ve been watching AI improve at an incredible speed.

Every new model seems smarter than the one before it. Tasks that used to take hours can now be done in minutes. Problems that required deep research can often be solved with a simple prompt.

A few years ago, we used to say:

Think of AI as a junior developer.

That made sense at the time.

But today, I don’t think that’s true anymore.

AI still makes mistakes. Sometimes very obvious ones.

But it also comes up with solutions that surprise me. Sometimes it finds an approach I wouldn’t have thought of immediately. Sometimes it helps me solve a problem much faster than I could on my own.

And honestly, that’s both exciting and a little scary.

But the biggest thing AI changed wasn’t how I write software.

It changed how I think about my work.

For most of my career, I thought I loved writing code.

I spent years doing it. At work, on side projects, and whenever I had free time.

Then AI became part of my daily workflow.

In the last month, I’ve built more projects than I normally would in an entire year.

Ideas that had been sitting in my notes for years suddenly became possible.

And that’s when I realized something important:

I don’t actually love writing code.

I love building things.

I love taking an idea and turning it into something real.

I love creating products, solving problems, and seeing something that only existed in my head become something people can use.

Code was simply the tool I used to do that.

And now AI is another tool.

That’s why I don’t hate it.

In many ways, AI has helped me build more than ever before.

It helped me revisit old ideas that I never had time to work on.

It helped me experiment faster.

It even encouraged me to explore areas outside software development, like animation and content creation.

And this isn’t just happening to programmers.

AI is changing design.

It’s changing writing.

It’s changing marketing.

It’s changing video production.

It’s changing almost every creative and technical field.

What excites me the most is what this means for people who are just getting started.

There are millions of people with great ideas who never studied Computer Science.

They never learned Software Engineering.

They don’t have years of experience.

Before AI, many of those ideas would never leave their notebooks.

Now they have a chance to build them.

Will every product be great?

Of course not.

Will some of them break when real users start using them?

Absolutely.

But for the first time, many people can turn an idea into reality without spending years learning every technical detail first.

That doesn’t mean expertise is no longer important.

AI removed many of the barriers to getting started.

It did not remove the barriers to excellence.

Building something is easier.

Building something truly great still takes skill, experience, and hard work.

The best products will still come from people who understand their craft and continue learning.

So how do I feel about AI?

Honestly, I feel three things at the same time.

I’m excited.

I’m impressed.

And I’m a little worried.

But overall, I’m optimistic.

I think AI will help more people create, experiment, and build the ideas they’ve been carrying around for years.

And that’s a future I’m happy to be part of.

As long as the AI companies don’t eventually make us apply for a visa just to use it.

DEV Community dev.to community dev-to software-dev technology 2026-06-18 18:22

↗

Introduction In this post, I'll walk you through deploying a production-ready Kubernetes cluster on Red Hat Enterprise Linux 10 using kubeadm. This lab was inspired by Anthony E. Nocentino's excellent Certified Kubernetes Administrator (CKA): Using kubadm to Install a Basic...

Introduction

In this post, I'll walk you through deploying a production-ready Kubernetes cluster on Red Hat Enterprise Linux 10 using kubeadm. This lab was inspired by Anthony E. Nocentino's excellent Certified Kubernetes Administrator (CKA): Using kubadm to Install a Basic Cluster training course, which is part of the official Certified Kubernetes Administrator (CKA) path on Pluralsight.

⭐ Shout-out: Anthony is a fantastic trainer! His course uses Ubuntu 22.04 as the base OS. I adapted his approach to work on RHEL 10, adding some additional considerations specific to Red Hat's ecosystem.

One intentional decision in this setup: I deployed Kubernetes v1.35 and CRI-O v1.35, which wasn't the latest version available at installation time.

This was purposeful. Anthony's course includes a dedicated section on upgrading clusters, and using a slightly older baseline makes that learning path clearer.

The upgrade procedures (not covered here) are what really solidify your understanding of cluster lifecycle management.

Lab Infrastructure Overview

Nodes Configuration

Node	Role	RAM	vCPUs	IP Address
rh-cp1	Control Plane	12 GiB	2	192.168.110.120
rh-node1	Worker	6 GiB	2	192.168.110.121
rh-node2	Worker	6 GiB	2	192.168.110.122
rh-node3	Worker	6 GiB	2	192.168.110.123

Note: The IP address schema is just an example and what was more convenient for me.

Supporting Infrastructure

A dedicated utilities VM (also RHEL 10) provides essential services:

DNS (BIND/named)
NTP (chrony)
HTTP (Apache/httpd)
DHCP (Kea)

This centralized infrastructure simplifies name resolution across all cluster nodes. But this is not essential for this project. You can, instead, ensure the nodes are able to reach each other updating the file /etc/hosts on all nodes.

Prerequisites & OS Preparation

Before diving into Kubernetes, we need consistent node preparation across all machines.

1. System Registration and Updates

$ sudo subscription-manager register --username <username> --password <password>
$ sudo dnf update redhat-release
$ sudo dnf upgrade
$ sudo reboot

2. Disable Swap (Required by Kubernetes)

Edit /etc/fstab to comment out swap entries:

UUID=xxxxxxxx-xxx-xxxx-xxxx-xxxxxxxxxxxx none swap defaults 0 0
# ^ Comment this line out

Verify:

$ sudo swapoff -a
$ free

3. Disable Firewalld

Disable firewalld, as indicated in the Calico System requirements for Kubernetes:

$ sudo systemctl stop firewalld
$ sudo systemctl disable firewalld
$ sudo systemctl mask firewalld

⚠️ Production Note: Use Calico to maintaining security and enforce network policies later.

4. Load Kernel Modules and Enable IP Forwarding

$ cat <<EOF | sudo tee /etc/modules-load.d/k8s.conf
overlay
br_netfilter
EOF

$ sudo modprobe overlay
$ sudo modprobe br_netfilter

Configure sysctl parameters:

$ cat <<EOF | sudo tee /etc/sysctl.d/k8s.conf
net.bridge.bridge-nf-call-iptables = 1
net.bridge.bridge-nf-call-ip6tables = 1
net.ipv4.ip_forward = 1
EOF

$ sudo sysctl --system

Verify the modules loaded correctly:

$ lsmod | grep overlay
$ lsmod | grep br_netfilter

Installing Kubernetes Components

Setting Version Variables

$ KUBERNETES_VERSION=v1.35
$ CRIO_VERSION=v1.35

Adding Repositories

Create the Kubernetes repo:

$ cat <<EOF | sudo tee /etc/yum.repos.d/kubernetes.repo
[kubernetes]
name=Kubernetes
baseurl=https://pkgs.k8s.io/core:/stable:/$KUBERNETES_VERSION/rpm/
enabled=1
gpgcheck=1
gpgkey=https://pkgs.k8s.io/core:/stable:/$KUBERNETES_VERSION/rpm/repodata/repomd.xml.key
EOF

Create the CRI-O repo:

$ cat <<EOF | sudo tee /etc/yum.repos.d/cri-o.repo
[cri-o]
name=CRI-O
baseurl=https://download.opensuse.org/repositories/isv:/cri-o:/stable:/$CRIO_VERSION/rpm/
enabled=1
gpgcheck=1
gpgkey=https://download.opensuse.org/repositories/isv:/cri-o:/stable:/$CRIO_VERSION/rpm/repodata/repomd.xml.key
EOF

Installing Packages

$ sudo dnf install -y kubelet kubeadm kubectl cri-o container-selinux

Configuring CRI-O

Create a cgroup manager configuration file:

$ cat <<EOF | sudo tee /etc/crio/crio.conf.d/02-cgroup-manager.conf
[crio.runtime]
conmon_cgroup = "pod"
cgroup_manager = "cgroupfs"
EOF

Enable and start services:

$ sudo systemctl enable --now crio kubelet
$ sudo systemctl restart crio

Version Locking

To prevent accidental upgrades:

$ sudo dnf install 'dnf-command(versionlock)'
$ sudo dnf versionlock add kubeadm-1.35.4 kubelet-1.35.4 kubectl-1.35.4 cri-o-1.35.2

Note: Review the output from the installation of the packages kubeadm, kubelet, kubectl and cri-o, and update the versions to lock in the command above.

Initializing the Control Plane

On rh-cp1, download and configure Calico networking. To know the current lastest version, check Tigera documentation, in the Manifest tab:

$ wget https://raw.githubusercontent.com/projectcalico/calico/v3.32.0/manifests/calico.yaml

Edit the CALICO_IPV4POOL_CIDR value to match your pod network plan:

- name: CALICO_IPV4POOL_CIDR
  value: "10.244.0.0/16"

Initialize the cluster, to use the same subnet:

$ sudo kubeadm init \
  --kubernetes-version v1.35.4 \
  --pod-network-cidr=10.244.0.0/16 \
  --cri-socket unix:///var/run/crio/crio.sock \
  --upload-certs

Once successful, save the join commands that appear at the end of the output—you'll need these for worker nodes!

To start using your cluster, you need to run the following as a regular user:

  mkdir -p $HOME/.kube
  sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
  sudo chown $(id -u):$(id -g) $HOME/.kube/config

Alternatively, if you are the root user, you can run:

  export KUBECONFIG=/etc/kubernetes/admin.conf

You should now deploy a pod network to the cluster.
Run "kubectl apply -f [podnetwork].yaml" with one of the options listed at:
  https://kubernetes.io/docs/concepts/cluster-administration/addons/

Then you can join any number of worker nodes by running the following on each as root:

kubeadm join 192.168.110.120:6443 --token xxxxxx.xxxxxxxxxxxxxxxx \
 --discovery-token-ca-cert-hash sha256:xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  
$

Your Kubernetes control-plane has initialized successfully!

Configuring kubectl

$ mkdir -p $HOME/.kube
$ sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
$ sudo chown $(id -u):$(id -g) $HOME/.kube/config

Deploy Calico Network

$ kubectl apply -f calico.yaml

Wait a few minutes and verify pods are running:

$ kubectl get pods --all-namespaces
$ kubectl get nodes

Expected output:

$ kubectl get pods --all-namespaces
NAMESPACE     NAME                                       READY   STATUS    RESTARTS   AGE
kube-system   calico-kube-controllers-6b4b6457d5-p98c2   1/1     Running   0          2m47s
kube-system   calico-node-5vjrr                          1/1     Running   0          2m47s
kube-system   coredns-7d764666f9-r4vqp                   1/1     Running   0          4m9s
kube-system   coredns-7d764666f9-vh7df                   1/1     Running   0          4m9s
kube-system   etcd-rh-cp1                                1/1     Running   0          4m28s
kube-system   kube-apiserver-rh-cp1                      1/1     Running   0          4m28s
kube-system   kube-controller-manager-rh-cp1             1/1     Running   0          4m27s
kube-system   kube-proxy-4r6h8                           1/1     Running   0          4m10s
kube-system   kube-scheduler-rh-cp1                      1/1     Running   0          4m28s
$ kubectl get nodes
NAME     STATUS   ROLES           AGE     VERSION
rh-cp1   Ready    control-plane   4m40s   v1.35.4
$

Joining Worker Nodes

On each worker node (rh-node1, rh-node2, rh-node3), run the join command saved during kubeadm init:

$ sudo kubeadm join 192.168.110.120:6443 \
  --token <token> \
  --discovery-token-ca-cert-hash sha256:<hash>

Verify the cluster health from the control plane:

$ kubectl get nodes

All nodes should show Ready status.

$ kubectl get nodes
NAME       STATUS   ROLES           AGE     VERSION
rh-cp1     Ready    control-plane   10m     v1.35.4
rh-node1   Ready    <none>          2m11s   v1.35.4
rh-node2   Ready    <none>          113s    v1.35.4
rh-node3   Ready    <none>          102s    v1.35.4
$

Install bash completion for kubectl:

$ sudo dnf install bash-completion
$ echo "source <(kubectl completion bash)" >> ~/.bashrc
$ source ~/.bashrc

Testing the Deployment

Deploy a test application:

$ kubectl create deployment hello-world --image=psk8s.azurecr.io/hello-app:1.0
$ kubectl get pods -o wide

Expose it via a service:

$ kubectl expose deployment hello-world --port=80 --target-port=8080
$ kubectl get service hello-world

For example:

$ kubectl get service hello-world
NAME          TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)   AGE
hello-world   ClusterIP   10.99.168.218   <none>        80/TCP    8s
$

Then, test it. You should see a response from your running container!

$ curl http://10.99.168.218:80
Hello, world!
Version: 1.0.0
hello-world-b5b7f67cc-d26dt
$

Clean up after testing:

$ kubectl delete service hello-world
$ kubectl delete deployment hello-world

Key Considerations When Moving from Ubuntu to RHEL

Here are the main differences I encountered adapting Anthony's tutorial:

Aspect	Ubuntu Approach	RHEL 10 Adaptation
Package Manager	apt/dpkg	dnf/rpm
Firewall Management	ufw/firewalld optional	firewalld disabled (use Calico policies)
Subscription	N/A	subscription-manager required
SELinux	Permissive mode default	Need to handle SELinux context
CRI Runtime	containerd	CRI-O

What's Next?

This setup provides a solid foundation for learning Kubernetes administration. From here, you could explore:

Cluster upgrades (covered extensively in Anthony's course)
Network policy enforcement with Calico
High availability with multiple control plane nodes
Storage classes and persistent volumes
Monitoring stack with Prometheus/Grafana

If you found this walkthrough helpful, I'd highly recommend checking out the original Pluralsight course. Anthony's explanations are crystal clear, and adapting them to different distributions is an excellent way to deepen your understanding of what happens under the hood.

Resources

Pluralsight: Certified Kubernetes Administrator Certification Path
Official Kubernetes Documentation
Calico Project GitHub

Thanks for reading! Feel free to share your own experiences with Kubernetes on RHEL in the comments below.

DEV Community dev.to community dev-to software-dev technology 2026-06-18 18:21

↗

This week's tooling news clusters around a recurring theme: removing dependencies that were never really necessary. Biome ditches the TypeScript compiler for type-aware linting. Swift developers stop caring which editor they're in. And the most interesting finding of the week...

This week's tooling news clusters around a recurring theme: removing dependencies that were never really necessary. Biome ditches the TypeScript compiler for type-aware linting. Swift developers stop caring which editor they're in. And the most interesting finding of the week is that a 1990s text-retrieval algorithm outperforms GPT-4 at catching lying agents. Here's what's worth your attention.

Swift Extension Lands on Open VSX Registry

The official Swift extension is now published to the Open VSX Registry, which means Cursor, VSCodium, AWS Kiro, and any other LSP-compatible editor that doesn't use the proprietary VS Code Marketplace can now auto-install it without you doing anything. Code completion, debugging, and the test explorer just work.

This matters because the Swift toolchain has always been Xcode-or-fight. Any serious cross-platform Swift work meant manually tracking down extensions, pinning versions, and hoping nothing broke when someone cloned the repo on a different machine. Agentic IDEs that provision their own extensions automatically—like Cursor and Kiro—now get Swift support without intervention.

Verdict: Ship. If you're already in an Open VSX-compatible editor, there's nothing to configure. Zero blocking concerns; this is a pure reduction in setup friction.

Biome v2 Adds Type Inference Without TypeScript

Biome v2 ships its own type inference engine, decoupling type-aware linting rules from the TypeScript compiler entirely. The headline number is 75% detection parity on floating promise rules compared to typescript-eslint—lower recall, but at meaningfully lower install weight and CI overhead. Multi-file analysis also lands in v2, unlocking rules that require cross-module context that were structurally impossible in v1.

The real value proposition isn't feature parity—it's dependency elimination. Pulling TypeScript out of your lint pipeline reduces cold-start times in CI and removes a whole class of version-mismatch bugs between typescript, @typescript-eslint/parser, and tsconfig.json. For teams already using Biome for formatting, this removes the last reason to keep eslint in the chain.

The catch: 75% recall on floating promises is a preliminary benchmark, not a production confidence threshold. You will miss some issues that typescript-eslint catches.

Verdict: Ship for formatting and linting speed gains now. Evaluate type-inference rules—run them in warn-only mode alongside your existing setup until you've validated recall on your codebase. Migrate with biome migrate --write and audit breaking config changes before cutting over.

Durable Object Facets Load Agent Code With Storage

Cloudflare's new Durable Object Facets let you load dynamically generated JavaScript classes into a supervisor isolate, each with its own isolated SQLite storage, request interception, and built-in metering hooks. The API surface is minimal: this.ctx.facets.get() with a dynamic class reference.

The pattern this unlocks is significant. Previously, if you were building a platform where users generate or configure agent code, you had a hard choice: run it in a disposable sandbox with no persistence, or provision real infrastructure with no containment boundary. Facets give you both—persistent storage and isolation—inside a Cloudflare Workers deployment. Logging and metering are interception points on the supervisor, not bolted-on external calls.

Verdict: Ship if you're building any code generation → persistent application platform. This is in open beta and the syntax is straightforward. If you're already on Cloudflare Workers and doing anything with user-generated agent logic, try this immediately.

LLM Judges Fail at Detecting False Agent Success

This is the most operationally important finding of the week. Researchers benchmarked LLM judges against lightweight TF-IDF detectors for catching agents that falsely report task completion. TF-IDF won by 4–8x on recall, at 3,300x lower latency. On tau2-bench the TF-IDF detector hits AUROC 0.83; on AppWorld it reaches 0.95.

Silent agent failures—tasks logged as complete that aren't—are a production monitoring problem, not a research curiosity. If your agent evaluation pipeline uses an LLM to verify completion, you're paying inference costs for worse recall than a statistical classifier you could train in an afternoon. The requirement is baseline labeling on your domain: collect examples of genuine completions and false completions, train a task-specific TF-IDF classifier, deploy it as a monitoring layer.

The intuition for why this works: false completion responses tend to be formulaic. Agents that give up and lie about it produce characteristic token patterns that a calibrated statistical detector catches reliably. LLM judges, by contrast, are susceptible to confident-sounding but wrong assertions.

Verdict: Ship as a monitoring layer now. No latency penalty, higher recall, and domain calibration is achievable with modest labeling investment. Don't replace your full eval suite—add this as a triage layer on completion signals.

Community Trains Reasoning Models on Free Kaggle TPUs

Google's Tunix hackathon published end-to-end recipes for adding chain-of-thought reasoning to small models (Gemma 2B and 3 1B) using SFT, preference optimization, and GRPO—all runnable in roughly 9 hours on free Kaggle TPU quota. Datasets range from 33k to 70k samples; reward functions use either LLM-as-judge or TF-IDF scoring.

The practical unlock here is domain-specific reasoning without frontier model dependency. Medical, legal, chemistry, and robotics reasoning tasks have structured correctness criteria that make reward function design tractable. If you have labeled domain data and a clear definition of a correct reasoning chain, you can now post-train a 1–2B model to reason in your domain for free.

The techniques are battle-tested—winners' code and Colab tutorials are published.

Verdict: Evaluate. If you have a domain reasoning problem and labeled data, run the published Colab now. If you're waiting for GPT-5 to solve domain-specific reasoning for you, this is the alternative worth understanding.

Tigris Adds Bucket Location Types for Compliance

Tigris now lets you specify data residency at bucket creation time: global, multi-region, dual-region, or single-region. Multi-region buckets are priced at $0.025/GB/month with zero egress fees. The eur location flag pins data to European infrastructure for GDPR compliance without custom replication logic.

This is a straightforward replacement for hand-wired S3 cross-region replication patterns. The pricing model—no egress fees, flat per-GB—makes cost predictable in ways that AWS S3 data transfer billing is not. Existing buckets can migrate through the dashboard Settings panel; new buckets get configured at creation with tigris mk my-bucket --locations eur or equivalent API call.

Verdict: Ship if you have data sovereignty requirements. Evaluate if you're currently managing cross-region replication manually and want to simplify the operational surface. No meaningful adoption risk.

If any of these landed on something you're actively building, Dev Signal covers this kind of analysis every issue—no hype, just the tooling changes that actually affect how you ship. Subscribe and get it directly in your inbox.

DEV Community dev.to community dev-to software-dev technology 2026-06-18 18:21

↗

When you talk to ChatGPT about a subject you understand well, you quickly notice something. The first answer is rarely the final answer. You add context. You correct an assumption. You explain what has already been tried. You point out that one proposed solution conflicts...

When you talk to ChatGPT about a subject you understand well, you quickly notice something.

The first answer is rarely the final answer.

You add context.

You correct an assumption.

You explain what has already been tried.

You point out that one proposed solution conflicts with another part of the system.

After a few iterations, the answer becomes useful.

The same thing happens when AI writes code for real products.

The difference is that a slightly incorrect explanation in a chat is usually harmless. Slightly incorrect code can become part of your product, pass a superficial review, and remain there for years.

This is why successful AI adoption in software engineering is not primarily about generating more code.

It is about context engineering: giving AI enough context, constraints, and feedback to generate code that belongs in your system.

The First Answer Is Usually Not Enough

AI coding tools are very good at producing plausible solutions.

That word matters: plausible.

The code may compile.

The tests may pass.

The implementation may even look clean when reviewed in isolation.

But software does not exist in isolation.

A change must fit the broader system architecture:

the current architecture
existing domain rules
security requirements
operational constraints
established conventions
previous technical decisions
future product direction

An AI assistant does not automatically understand those things.

It knows the code it can see and the engineering context you provide. Everything outside that window must be inferred.

And inference is where divergence begins.

If you trust the first response without validating its assumptions, you are usually not accelerating engineering.

You are accelerating uncertainty.

Lack of Context Creates Duplication

One of the first visible effects is duplication.

AI does not necessarily know that your application already has:

a validation helper for the same domain rule
an established authorization pattern
a shared API client
a retry mechanism
an error hierarchy
an existing abstraction for the same operation

Without sufficient engineering context, it creates another one.

The new implementation may be technically correct. It may also be slightly different from the existing implementation.

Now the system has two ways to solve the same problem.

Repeat this across dozens of AI-generated changes and the codebase slowly breaks a basic rule engineers call DRY (don't repeat yourself). The same problem should be solved once, in one place, not rebuilt slightly differently every time. Not because anyone deliberately chose duplication, but because each generated change was created from a partial view of the system.

This is one of the biggest risks of optimizing purely for development speed.

The individual pull request looks faster.

The product becomes slower to change.

Speed Is Not the Same as Longevity

Many companies currently measure AI adoption by how much faster engineers can produce code.

That is understandable.

Generated code is visible. Long-term maintainability is not.

You can measure how quickly a feature reached production. It is much harder to measure how much architectural inconsistency was introduced along the way.

The cost appears later:

similar features behave differently
abstractions begin to overlap
documentation stops matching implementation
developers no longer know which pattern is preferred
changes require more investigation
AI receives increasingly contradictory context

The system becomes harder for humans to understand.

It also becomes harder for AI to understand.

This creates a feedback loop.

Weak context produces inconsistent AI-generated code. Inconsistent code produces weaker context for the next generation.

From Artisanal Furniture to IKEA

Coding has always contained an element of craft.

Experienced engineers develop preferences around naming, abstractions, interfaces, failure modes, and the shape of a maintainable system.

In that sense, software development has often resembled building artisanal furniture.

Each piece receives attention. Decisions are made by someone who understands the material, the environment, and the intended use.

AI changes the economics.

We are moving from artisanal furniture towards IKEA.

Software components can be produced faster, in greater volume, and by more people. That is not necessarily bad.

Most companies do not need every internal tool to be a handcrafted masterpiece.

Mass production works when the pieces are standardized and the person assembling them has reliable instructions.

Without the instructions, you do not get scalable production.

You get a room full of boards, screws, and several pieces that look almost identical but are not interchangeable.

Who Writes the Instructions?

In software organizations, those instructions are created by experienced engineers with product knowledge.

Senior and staff engineers understand more than the syntax of the codebase.

They know:

why a particular architectural decision was made
which shortcuts are temporary
where regulatory or security constraints exist
which parts of the product are likely to change
which abstractions have already failed
what must remain consistent across teams

That knowledge needs to become accessible to both people and machines.

This includes:

Architecture Decision Records
engineering standards
domain documentation
code examples
security policies
testing requirements
dependency rules
service ownership
operational runbooks

The goal is not documentation for its own sake.

The goal is to turn organizational knowledge into usable engineering context.

A codebase without recorded decisions already creates problems for junior developers.

AI amplifies the same problem.

When there are no written ADRs, your junior AI engineer will happily make a different architectural decision in every pull request.

Old Knowledge Is Dangerous Context

Creating a knowledge base is not enough.

It must remain current.

Outdated documentation can be worse than missing documentation because it gives the model confidence in the wrong constraints.

Imagine that your architecture document still recommends a pattern the team abandoned six months ago.

A human engineer may notice the discrepancy through conversations, recent pull requests, or experience with the system.

AI may treat the document as authoritative.

It then generates code that follows an obsolete direction, while appearing perfectly aligned with the company's written standards.

This means engineering knowledge must be treated like code:

versioned
reviewed
owned
tested where possible
removed when no longer valid

AI can help here too.

It can compare documentation with current implementation, detect references to removed services, identify conflicting instructions, and highlight architectural patterns that are no longer followed.

But the final responsibility still belongs to the organization.

Automating knowledge checks does not remove ownership.

It makes ownership easier to exercise.

Guardrails Matter More Than Prompts

Good prompt engineering helps.

Engineering guardrails matter more.

AI-generated code should move through the same engineering system as human-generated code, with stronger automation where appropriate.

That means proper CI/CD controls:

static analysis
automated tests
type checking
dependency scanning
security scanning
architectural boundary checks
migration validation
deployment safeguards

Where rules can be expressed automatically, they should not depend on someone remembering them during every conversation with an AI assistant.

A prompt can say:

Do not access the database directly from this layer.

An architectural test can enforce it.

The second one scales better.

The most successful AI-enabled teams will not be those with the longest prompt files. They will be those that convert important engineering decisions into automatically verifiable constraints.

Human-in-the-Loop Review

Not every change requires the same level of human attention.

Generating a small internal tool is different from changing payment processing, authorization, infrastructure, or a financial ledger.

Organizations need to decide where a human must remain directly involved.

Critical areas may include:

security boundaries
financial calculations
data migrations
compliance logic
production infrastructure
irreversible operations
public APIs
architectural changes

Human review is not there because AI is useless.

It is there because responsibility cannot be delegated to a probabilistic system. That is the core of human-in-the-loop validation.

The important question is not whether AI participated in the change.

The important question is whether the organization applied the right level of control to the risk involved.

Code Review Is Changing

Code review remains crucial.

But reading every generated line manually is becoming less realistic as the volume of AI-generated code increases.

The answer is not to stop reviewing.

The answer is to change how review works.

Engineers increasingly guide AI to examine changes from multiple perspectives:

Does this duplicate an existing implementation?
Does it violate an architectural decision?
Are authorization checks consistent with similar endpoints?
What edge cases are missing?
Can this migration be safely rolled back?
Which assumptions does this code make?
Does the documentation still match the implementation?

The reviewer moves from inspecting every character to directing a structured investigation.

AI generates code.

AI can also criticize it, compare it with the rest of the system, and search for inconsistencies.

The human decides which questions matter and whether the answers are sufficient.

That is still engineering judgment.

Where Juniors Fit

Junior engineers are still welcome.

AI does not remove the need for people entering the profession. It changes the environment in which they learn.

Previously, juniors often developed their skills by writing relatively simple implementations and receiving feedback from more experienced engineers.

AI can now produce many of those implementations faster.

But generating the implementation was never the final objective.

The objective was learning how to reason about a system.

Juniors still need to understand:

how to validate assumptions
how to investigate existing code
how to recognize duplication
how to test behavior
how to evaluate trade-offs
how to ask better questions
when to escalate a decision

Companies should not optimize juniors purely for prompt throughput.

They should optimize for how quickly people build reliable mental models of the product.

That requires access to knowledge, mentorship, and clear engineering standards.

The same context engineering that makes AI better also makes junior engineers better.

AI Naturally Favors What Already Exists

There is another longer-term concern.

AI models are trained on existing knowledge.

They naturally lean towards technologies, libraries, and architectural patterns that appear frequently in their training data and online documentation.

That gives established technologies an additional advantage.

A mature framework has:

years of documentation
thousands of public examples
answered questions
tutorials
production case studies
code available in public repositories

A new framework has very little of this.

Even when the new framework offers a better approach, AI may produce weaker results because it has less knowledge about how to use it correctly.

This may create an unexpected effect.

AI could accelerate development inside established ecosystems while slowing the adoption of genuinely new ones.

Teams may avoid newer technologies not because they are worse, but because their AI tools are less effective with them.

The internet was already influenced by popularity.

AI may reinforce that influence by turning popularity into implementation quality.

New frameworks will need to treat machine-readable documentation, examples, migration guides, and strong tooling as core parts of adoption.

It will no longer be enough for a technology to be good.

AI must also be able to understand it.

There Is Still a Place for Craft

Not all software should become mass-produced furniture.

There will remain areas where code itself carries unusual value:

specialized algorithms
performance-critical systems
novel research
low-level infrastructure
proprietary domain knowledge
safety-critical software
new technical paradigms

These are places where the available training data may be limited and where correctness depends on knowledge that cannot be reconstructed from common public patterns.

There will also always be engineers who treat code as a form of art.

There is a place for that.

But most commercial software is not valuable because every function is uniquely beautiful.

It is valuable because it solves a real problem reliably and can continue evolving.

AI can help us build more of it.

The condition is that we stop treating generated code as the finished product.

The Real AI Adoption Challenge

The difficult part of practical AI adoption is not selecting a coding assistant.

It is preparing the organization around it.

Do engineers have access to reliable product context?

Are architectural decisions documented?

Is outdated knowledge removed?

Can important rules be checked automatically?

Does CI/CD prevent unsafe shortcuts?

Are humans involved where the consequences justify it?

AI makes producing code cheaper.

It does not make understanding the product less important.

In fact, as implementation becomes easier, context engineering becomes the scarce resource.

The companies that benefit most from AI will not necessarily be those that generate the most code.

They will be those that provide the clearest instructions for how all the pieces should fit together.

DEV Community dev.to community dev-to software-dev technology 2026-06-18 18:21

↗

This week's releases split neatly into two categories: useful incremental hardening (uv, GitLab, Copilot) and things that should change how you architect systems today (Spring CVEs, pg_durable, and a Cornell paper that quietly invalidates a lot of RAG assumptions). The Spring...

This week's releases split neatly into two categories: useful incremental hardening (uv, GitLab, Copilot) and things that should change how you architect systems today (Spring CVEs, pg_durable, and a Cornell paper that quietly invalidates a lot of RAG assumptions). The Spring security cluster alone is enough to justify a dependency audit before the weekend.

uv 0.11.19 adds CPython 3.15 beta support

uv now always computes SHA256 checksums for remote distributions—previously this was situational—and adds PyEmscripten platform support per PEP 783, which formalizes Python packaging for browser and WASM targets. CPython 3.15.0b2 is available as a managed runtime, and a cross-platform installation edge case on Windows hosts has been resolved.

The SHA256 change is the one worth noting for security posture. Making verification unconditional rather than optional closes a gap where distribution integrity could go unchecked depending on resolver path. The PyEmscripten addition matters if you're packaging Python for browser runtimes—previously you were working around the absence of a formal platform tag; now you're not.

Verdict: Ship. Drop-in upgrade, no breaking changes. If you manage Python distributions or target WASM, update now. Everyone else should still update—supply-chain hardening by default is worth the two minutes.

GitLab 19.0 adds group-level review instructions, secrets manager

GitLab 19.0 ships two meaningful additions for teams: group-level custom review instructions for Duo code review, configured via .gitlab/duo/mr-review-instructions.yaml with cascading inheritance across projects, and a Secrets Manager that exits closed beta for Premium and Ultimate tiers.

Group-level review instructions solve a real annoyance—if you've been maintaining per-project AI review configuration across a monorepo organization, you can now centralize that and let projects inherit or override. It's the kind of change that sounds minor until you've had to sync a guideline update across fifteen repos manually.

The Secrets Manager is more interesting longer-term: native secrets storage reduces operational dependency on HashiCorp Vault or AWS Secrets Manager instances, but it's still in open beta and GitLab's own documentation flags it as not production-ready under strict policy requirements.

Verdict: Ship group-level review instructions now—it's live and the migration path is straightforward. Wait on Secrets Manager until it hits stable release, or evaluate it in a non-production environment if you want early familiarity.

Spring ecosystem ships AI 2.0, patches security flaws

Spring AI 2.0 GA is out, but the more urgent story is the CVE cluster shipping alongside it. Spring HATEOAS, Spring Kafka, Spring LDAP, Spring Security, Spring AMQP, and Spring Vault all carry patches for deserialization vulnerabilities and authentication bypasses. These aren't theoretical—deserialization and auth bypass CVEs in widely deployed frameworks have a short window between disclosure and exploitation.

On the AI side, Spring AI 2.0 deprecates older Gemini model enums. If you're referencing GEMINI_2_0_FLASH or GEMINI_2_0_FLASH_LIGHT in existing code, those break—migration target is GEMINI_3_1_PRO_PREVIEW. Spring Data 2026.0.0 adds type-safe property paths and Kotlin 2.3.20 support, and Spring Vault introduces VaultClient and ReactiveVaultClient abstractions for path handling.

Verdict: Ship the CVE patches immediately—Spring Boot, Security, AMQP, Kafka, and Vault updates are not optional. Evaluate Spring AI if you're on older Gemini integrations; the enum migration is a breaking change but the path is clear. Wait on Vault's new path abstractions until you've validated them in staging.

PostgreSQL extension eliminates external workflow orchestration

pg_durable is a Rust-based PostgreSQL background worker that lets you define fault-tolerant, long-running workflows as native SQL functions. It handles checkpointing, retry logic, and crash recovery internally, using a custom DSL with ~> and |=> operators to express workflow steps.

The pitch is direct: if your stack is already Postgres-centric and you're running Temporal, an external job scheduler, or an async task queue primarily to get durable execution semantics, this replaces that infrastructure. Workflow state lives in Postgres, execution resumes from checkpoints after crashes, and you're not managing a separate service boundary. For vector pipelines and scheduled maintenance tasks in particular, the operational simplification is real.

The caveats are real too. It's an early-stage extension, there's a DSL to learn, and running a Rust-based background worker in your Postgres instance is a different operational profile than a sidecar service.

Verdict: Evaluate for greenfield Postgres-native workloads or internal tooling where you control the environment. Wait for production-critical workflows until the extension has more operational history behind it.

13-word Reddit snippets poison AI search results

Cornell researchers published a straightforward attack: single user-generated comments with high lexical similarity to a target query reliably manipulate LLM outputs and citations when those sources are included in retrieval. The attack works on Reddit, Wikipedia, and similar UGC platforms—trivially placeable content that doesn't require infrastructure access.

For developers building RAG systems or integrating deep research agents that pull from public web sources, this is an architectural concern, not just an academic finding. If your retrieval pipeline sources from UGC platforms and surfaces citations to users, you're currently importing adversarially poisoned content at scale with no detection layer. The reliability contract that makes cited sources meaningful breaks under this attack.

Mitigation requires validation of cited content against author and domain reputation signals, deduplication of suspiciously similar claims across sources, and lexical anomaly detection for query-aligned text. None of those are trivial to build correctly.

Verdict: Evaluate your retrieval pipeline now if you cite Reddit or Wikipedia in agent outputs. This isn't production-ready to ignore—it's a known exploit against a pattern many teams have already shipped. Build poison detection before expanding UGC source coverage.

Copilot routes tasks to right model automatically

GitHub Copilot's Auto selection mode now routes requests by task intent and real-time model health using HyDRA routing. The reported outcome is 72.5% cost reduction while maintaining output quality, achieved by matching task complexity to model capability rather than defaulting every request to the most capable available model. Prompt caching and deferred tool loading extend context budget efficiency in long agentic sessions.

For individual developers, the practical change is removing the cognitive overhead of model selection during extended sessions. For teams on Free or Student plans, Auto is becoming the default—the manual picker is consolidating away for those tiers anyway.

Verdict: Ship—it's already the default in VS Code, github.com, and mobile. No developer action required. The cache-aware routing is specifically designed to avoid mid-session quality degradation, which was the main failure mode of earlier automatic selection attempts.

If these weekly breakdowns save you time triaging what's actually worth acting on, Dev Signal lands in your inbox every issue with the same format. Subscribe at thedevsignal.com—senior engineers only, no filler.

DEV Community dev.to community dev-to software-dev technology 2026-06-18 18:20

↗

AI-disclosure: AI-assisted draft, human-reviewed. The demo numbers are the verbatim stdout of a deterministic, stdlib-only Python script included in full below — re-run it and you get the same bytes. The attempt counts in that script are a SYNTHETIC fixture I chose to...

AI-disclosure: AI-assisted draft, human-reviewed. The demo numbers are the verbatim stdout of a deterministic, stdlib-only Python script included in full below — re-run it and you get the same bytes. The attempt counts in that script are a SYNTHETIC fixture I chose to exercise the accounting mechanism, calibrated to the retry skew I see in my own scraper logs (run counts from my Apify history). It is NOT a benchmark of any named vendor's API or prices. The one external claim (the cost-per-successful-task formula) is attributed and linked.

The cheaper API was 2.5x cheaper per call. The monthly bill came in higher anyway.

Not by a rounding error. The "cheap" option cost 1.63x more per successful task than the one with the bigger sticker price. Same workload. The price page never showed me that number, because the price page doesn't know your success rate. You do — after you've already paid.

This is the arithmetic the per-call price hides. And it's a decision you make before you spend, not a cap you bolt on after.

TL;DR

You compare two API tiers by per-call (or per-token) price and pick the cheaper one. That ranking can be wrong.
You pay for every attempt — success or fail. The denominator that pays the bills is successful tasks, not calls.
True cost = price_per_attempt × attempts ÷ successes. A cheap tier with a low success rate burns its discount on retries.
In the run below: cheap tier $0.0020/call but $0.0096/success; robust tier $0.0050/call but $0.0059/success. The sticker winner loses.
For anyone choosing an API, model, or tier for an agent: log attempts and successes for a week, divide, then decide. A 70-line script is at the bottom — drop in your numbers.

The price page is a sticker, not a bill

Here's the trap, stated plainly. The number on the pricing page is per call. The number on your invoice is per call too — but the value you got is per successful task. Those are different denominators, and the gap between them is exactly the work that failed.

Every attempt is billed. The one that timed out and got retried: billed. The one that came back malformed and you re-prompted: billed. The one that succeeded on the fourth try: billed four times. If a tier fails 35% of its tasks and burns three to six attempts chasing each hard one, you are paying for a lot of calls that produced nothing you can use.

So the real question isn't "which tier is cheaper per call." It's "which tier is cheaper per task I actually completed." Those can point at different tiers. When they do, ranking by the sticker picks the loser.

The formula is small enough to fit in a sentence:

cost per successful task = cost per attempt ÷ success rate

Codebridge put it the same way in a February 2026 write-up titled, literally, Real Cost per Successful Task: "a model that costs $0.01 per attempt but succeeds only 50% of the time effectively costs $0.02 per success," and "the gap between attempted tasks and completed outcomes contains the bulk of real-world cost." (codebridge.tech) Same mechanism. My contribution here isn't the formula — it's showing the ranking flip with a number you can reproduce, and where the realistic retry shape comes from.

Measuring the flip

I wrote a small script to make the flip concrete. It's deterministic — stdlib only, no network, no random, no clock. Two tiers, 40 tasks each. For every task it records how many billed attempts it took and whether it ultimately succeeded. Then it computes three numbers per tier: per call, per task (spend spread over all tasks), and per successful task.

One honest caveat up front, because it matters: the attempt counts are a synthetic fixture I wrote by hand — numbers I chose to exercise the mechanism. They are not a measurement of any named vendor. What makes them realistic rather than arbitrary is that I shaped the skew to mirror what I see in my own scraper production logs across 2,190 lifetime runs: the cheap, flaky source eats far more retries per success than the stable one. The mechanism is real. The specific cells are illustrative. Swap in your own and the script does the same arithmetic.

#!/usr/bin/env python3
# cost_per_successful_task.py
# Deterministic, stdlib-only, no network. Fixture is inlined below.
#
# Question this answers:
#   You pick the option with the cheaper per-call price. Is it actually cheaper
#   PER SUCCESSFUL TASK once you pay for the failed attempts and retries?
#
# Mechanism (the whole point):
#   true cost-per-success = (price_per_attempt * attempts_spent) / successes
#   A cheap-per-attempt option with a low success rate makes you pay for the
#   wasted attempts on every retry. The headline price lies. The denominator
#   that matters is *successful tasks*, not calls.
#
# This is NOT an LLM benchmark. It is a stdlib simulation of the accounting
# mechanism. The attempt counts are a fixed, hand-written fixture (no RNG),
# chosen to mirror the retry skew we see in our own scraper production logs
# (2,190 lifetime runs): the "cheap" tier eats far more retries per success.

PRICE = {
    # price charged PER ATTEMPT (every attempt is billed, success or fail)
    "cheap_tier":  0.0020,   # looks 2.5x cheaper per call
    "robust_tier": 0.0050,
}

# Fixture: for each task we record how many BILLED attempts it took, and
# whether it ultimately SUCCEEDED. Deterministic, written out by hand so the
# run is fully reproducible. 40 tasks per tier.
#   - cheap_tier: low success rate, heavy retrying (mirrors our flaky-source logs:
#     the cheap option fails ~40% of tasks and burns 3-6 billed attempts chasing
#     each one before giving up or limping to a success)
#   - robust_tier: high success rate, almost always first-try
#
# Each entry = (attempts_billed, succeeded)
TASKS = {
    "cheap_tier": [
        (6, False), (1, True), (5, False), (2, True), (1, True),
        (6, False), (5, False), (2, True), (1, True), (4, True),
        (6, False), (3, True), (1, True), (2, True), (5, False),
        (1, True), (6, False), (2, True), (5, False), (1, True),
        (2, True), (3, True), (6, False), (1, True), (2, True),
        (6, False), (1, True), (4, True), (5, False), (1, True),
        (6, False), (2, True), (1, True), (5, False), (2, True),
        (1, True), (3, True), (6, False), (2, True), (1, True),
    ],
    "robust_tier": [
        (1, True), (1, True), (1, True), (2, True), (1, True),
        (1, True), (1, True), (1, True), (2, True), (1, True),
        (1, True), (1, False), (1, True), (1, True), (1, True),
        (2, True), (1, True), (1, True), (1, True), (1, True),
        (1, True), (1, True), (2, True), (1, True), (1, True),
        (1, True), (1, True), (1, True), (1, True), (2, True),
        (1, True), (1, True), (1, True), (1, True), (1, True),
        (1, True), (2, True), (1, True), (1, True), (1, True),
    ],
}


def summarize(tier):
    rows = TASKS[tier]
    price = PRICE[tier]
    n_tasks = len(rows)
    attempts = sum(a for a, _ in rows)
    successes = sum(1 for _, ok in rows if ok)
    spend = attempts * price
    success_rate = successes / n_tasks
    naive_per_call = price                      # the sticker price you compare
    naive_per_task = spend / n_tasks            # spend spread over ALL tasks
    true_per_success = spend / successes        # the number that pays the bills
    return {
        "tier": tier,
        "n_tasks": n_tasks,
        "attempts": attempts,
        "successes": successes,
        "success_rate": success_rate,
        "spend": spend,
        "naive_per_call": naive_per_call,
        "naive_per_task": naive_per_task,
        "true_per_success": true_per_success,
    }


def main():
    cheap = summarize("cheap_tier")
    robust = summarize("robust_tier")

    print("=" * 64)
    print("COST PER SUCCESSFUL TASK — sticker price vs the real bill")
    print("(stdlib simulation of the accounting mechanism; not an LLM bench)")
    print("=" * 64)
    print()
    hdr = "{:<12} {:>8} {:>9} {:>9} {:>12} {:>16}"
    print(hdr.format("tier", "per-call", "tasks", "success%", "per-task", "per-SUCCESS-task"))
    for r in (cheap, robust):
        print(hdr.format(
            r["tier"].replace("_tier", ""),
            f"${r['naive_per_call']:.4f}",
            r["n_tasks"],
            f"{r['success_rate']*100:.0f}%",
            f"${r['naive_per_task']:.4f}",
            f"${r['true_per_success']:.4f}",
        ))
    print()

    # Who wins on the sticker (per-call) price?
    sticker_winner = min((cheap, robust), key=lambda r: r["naive_per_call"])
    # Who wins on the number that actually pays the bills?
    real_winner = min((cheap, robust), key=lambda r: r["true_per_success"])

    ratio = cheap["true_per_success"] / robust["true_per_success"]

    print(f"Sticker price says cheapest: {sticker_winner['tier'].replace('_tier','')} "
          f"(${sticker_winner['naive_per_call']:.4f}/call)")
    print(f"Cost-per-SUCCESS says cheapest: {real_winner['tier'].replace('_tier','')} "
          f"(${real_winner['true_per_success']:.4f}/success)")
    print()
    print(f"The 'cheap' tier is {robust['naive_per_call']/cheap['naive_per_call']:.1f}x "
          f"cheaper per call,")
    print(f"but {ratio:.2f}x MORE EXPENSIVE per successful task.")
    print()
    print(f"Why: cheap tier burned {cheap['attempts']} attempts for "
          f"{cheap['successes']} successes "
          f"({cheap['attempts']/cheap['successes']:.2f} attempts/success);")
    print(f"     robust tier burned {robust['attempts']} attempts for "
          f"{robust['successes']} successes "
          f"({robust['attempts']/robust['successes']:.2f} attempts/success).")
    print()
    print("VERDICT: the per-call price flipped the winner. The decision is made")
    print("BEFORE you spend — on cost-per-success, not on the sticker.")

    # ---- asserts: lock the invariants that make the article true ----
    # 1) cheap really is cheaper per call
    assert cheap["naive_per_call"] < robust["naive_per_call"]
    # 2) ...but the winner FLIPS on cost-per-success
    assert cheap["true_per_success"] > robust["true_per_success"]
    assert sticker_winner["tier"] == "cheap_tier"
    assert real_winner["tier"] == "robust_tier"
    # 3) the flip is material (cheap is >1.5x worse per success)
    assert ratio > 1.5
    print()
    print("All asserts passed.")


if __name__ == "__main__":
    main()

Run it with python3 -I cost_per_successful_task.py. Here is the exact output:

================================================================
COST PER SUCCESSFUL TASK — sticker price vs the real bill
(stdlib simulation of the accounting mechanism; not an LLM bench)
================================================================

tier         per-call     tasks  success%     per-task per-SUCCESS-task
cheap         $0.0020        40       65%      $0.0063          $0.0096
robust        $0.0050        40       98%      $0.0057          $0.0059

Sticker price says cheapest: cheap ($0.0020/call)
Cost-per-SUCCESS says cheapest: robust ($0.0059/success)

The 'cheap' tier is 2.5x cheaper per call,
but 1.63x MORE EXPENSIVE per successful task.

Why: cheap tier burned 125 attempts for 26 successes (4.81 attempts/success);
     robust tier burned 46 attempts for 39 successes (1.18 attempts/success).

VERDICT: the per-call price flipped the winner. The decision is made
BEFORE you spend — on cost-per-success, not on the sticker.

All asserts passed.

Read the table once. Per call, cheap is $0.0020 and robust is $0.0050 — exactly the 2.5x discount the sticker promises. Per successful task, cheap is $0.0096 and robust is $0.0059. The ranking flips. The discount didn't disappear; it got spent on the 14 tasks that never succeeded and the retries chasing them.

Why it flips: count attempts per success, not calls per dollar

The line that explains everything is the last one in the output: cheap burned 125 attempts for 26 successes — 4.81 attempts per success. Robust burned 46 attempts for 39 successes — 1.18 attempts per success.

That's a 4x difference in how many billed calls it takes to get one usable result. A 2.5x price discount cannot survive a 4x attempt penalty. The math isn't close. 0.0020 × 4.81 ≈ 0.0096. 0.0050 × 1.18 ≈ 0.0059. The cheap tier is cheaper at the unit you don't ship and more expensive at the unit you do.

Notice the middle column too — per-task, spreading spend over all 40 tasks, the two tiers look almost tied: $0.0063 vs $0.0057. That column is a trap of its own. It counts the failed tasks in the denominator as if they were worth something. They weren't. Divide only by what succeeded and the real gap shows up.

"But my success rates are basically the same"

You might be thinking: my two options aren't 65% vs 98%, they're more like 92% vs 95%, so this doesn't apply to me. Maybe. That's exactly the point, though — you don't know until you count, and you can't eyeball a 4x attempt ratio from a pricing table.

A small gap in success rate matters more than it looks when one tier also retries harder. Two things compound: the fraction that never succeeds (pure waste) and the attempts-per-success on the ones that do. A tier can have a "fine" 90% success rate and still burn three attempts on every hard task, and that second factor never shows up as a failure in your dashboard — it shows up as a bigger bill. So don't guess the gap. Log it.

Here's the honest limit of my own claim, since I'm asking you to log yours: I haven't run this exact A/B across two named LLM APIs in production. The retry skew is real and comes from my scraper logs, where flaky sources have always cost multiples more per usable record than stable ones. The two-tier flip in the script is a clean illustration of that pattern, not a vendor benchmark. If you run it for real and the gap is small — great, you just bought certainty for the price of one week of logging.

What to do Monday

The change is procedural, not technical, and it happens before you commit to a tier — not as a spending cap you add after the bill scares you.

Log two counters per option: billed attempts, and successful tasks. Most SDKs already surface attempt/retry counts; if not, increment a counter in your retry wrapper.
Run a week on each tier (or a fair sample of the same workload through both). You need real tasks, not a synthetic ping — your hard cases are where the cheap tier bleeds.
Rank by total_spend ÷ successes, not by the price page. That single division is the whole decision.
Then choose. If the cheap tier still wins on cost-per-success, take it with confidence. If it flips, you just avoided paying 1.6x more for the privilege of a smaller sticker.

This is upstream of every budget guardrail. A spending cap stops you after you've chosen wrong and started bleeding. Choosing on cost-per-success means there's less to cap, because you picked the tier that wastes fewer attempts in the first place. (If you do want the downstream guardrail too, the HTTP 402 budget piece is the other half of this — that one's about capping spend during a run; this one's about which option you pick before the run.)

The price page sells you a per-call number because it's the number that makes them look cheapest. The number that pays your invoice is per successful task. Compute the second one yourself, with your own logs, before you switch.

What's the widest gap you've seen between the sticker price and the real cost-per-success once you counted the retries? I'm collecting the worst flips — drop yours in the comments. 👇

Follow for the next batch of cost-per-success numbers from production. I read every comment.

DEV Community dev.to community dev-to software-dev technology 2026-06-18 18:20

↗

This week's AI tooling news splits cleanly between infrastructure you can ship today and capability bets that require more careful evaluation. Anthropic dropped two significant releases—Fable 5 and Managed Agents updates—while the Workflow SDK landed a cancellation primitive...

This week's AI tooling news splits cleanly between infrastructure you can ship today and capability bets that require more careful evaluation. Anthropic dropped two significant releases—Fable 5 and Managed Agents updates—while the Workflow SDK landed a cancellation primitive that eliminates entire categories of homegrown plumbing. Underneath all of it, a sharp incident review from Anthropic is the most practically useful thing published this week if you're running multi-turn agents in production.

Workflow SDK adds AbortController cancellation support

The Workflow SDK now threads AbortSignal through workflow steps, using the same web-standard API you already use with fetch. Pass an AbortSignal into your workflow, inspect it inside steps, and you get cooperative cancellation that survives durable suspension and replay.

This matters because cancellation in long-running workflows has historically required custom infrastructure—timeout flags passed through context, manual cleanup hooks, bespoke race logic. That's not interesting code to write or maintain. With AbortController support, you get timeout steps, request racing, and parallel work cancellation with patterns your team already knows.

Two important caveats: this requires workflow@beta, and cancellation is cooperative. The runtime won't forcibly terminate a step—your step code needs to inspect the signal and respond. If you have steps with opaque third-party calls that don't accept signals, you're still writing wrapper logic.

Verdict: Ship. If you're on Workflow SDK 5 and running long-horizon workflows with timeout or race requirements, upgrade and wire this in now. The pattern is standard, the boilerplate reduction is real, and there's no meaningful downside if your steps are already structured around explicit control flow.

Anthropic adds dreaming, outcomes to Managed Agents

Two distinct additions here. Outcomes let you define explicit success criteria enforced by a separate grader agent—replacing manual prompt tuning with a structured feedback loop. Dreaming adds scheduled memory review processes where agents extract patterns from past work, effectively giving long-running agents a form of structured introspection.

The outcomes feature is the immediately useful one. If you've been hand-tuning prompts to steer agent behavior toward task success, externalizing that into a grader agent with explicit criteria is a cleaner architecture. Anthropic reports a 10-point task success lift in internal testing, which is large enough to take seriously even with the usual caveats about benchmark conditions.

Multi-agent orchestration also gets step-by-step visibility in this release, which cuts a real debugging pain point. Opaque parallel agent execution is where hours disappear when something goes wrong.

Dreaming requires an access request—it's not generally available. Outcomes and multi-agent orchestration are in public beta.

Verdict: Evaluate. If you're already on Managed Agents, test outcomes now—the success criteria reframing is a one-time conceptual lift that pays off in reduced prompt iteration cycles. Request dreaming access if you have agents running across sessions. Don't migrate to Managed Agents solely for this release.

Anthropic releases Claude Fable 5 model widely

Fable 5 is Anthropic's highest-capability public model, positioned as the replacement for Opus 4.8 on long-horizon reasoning and complex code tasks. Pricing roughly doubles from Opus 4.8. The noteworthy implementation detail: domain-specific safeguards on cybersecurity and biology queries fall back to Opus 4.8 on approximately 5% of requests.

That fallback mechanic is the thing to test before committing. A 95% success rate sounds high until you're running a pipeline at scale—1-in-20 requests silently degrading to a different model is a determinism problem, not a capability problem. You need to know which queries trigger fallback, how to detect it in responses, and whether your use case lands in the affected domains.

For pure capability on tasks that don't touch the fallback domains, Fable 5 is materially stronger than Opus 4.8. The pricing increase is real and needs evaluation against your actual workload—cost-sensitive pipelines with high request volume should model this carefully before switching.

Verdict: Evaluate. If you're on Anthropic's API doing long-horizon reasoning or complex code generation outside the restricted domains, run a side-by-side benchmark now. If you're in cybersecurity or biology tooling, map the fallback behavior before touching production.

Google releases open DiffusionGemma model via NVIDIA

DiffusionGemma-26B is Apache 2 licensed, hosted on NVIDIA NIM, and benchmarks at 500+ tokens per second. No local setup required to start testing—NVIDIA NIM currently offers free tier access.

The Apache 2 license is the headline for production use cases. Closed diffusion APIs carry licensing friction that blocks certain deployment contexts; this removes that constraint. The throughput numbers are compelling for token-heavy multimodal workflows, though NIM's free tier quota limits and latency SLAs under production load are unknowns you'll need to measure yourself.

Verdict: Evaluate. Worth running throughput benchmarks now against your actual workload shapes. Production readiness depends on quota behavior you can only discover through testing. Don't replace a working closed API integration until you've measured latency under realistic concurrency.

Agent failures hide in cache, prompts, defaults

Anthropics's incident review is the most operationally useful piece of writing this week. The finding: context management errors, prompt constraint changes, and parameter defaults silently degrade multi-turn agent behavior without producing crashes or obvious errors. Agents forget decision rationale, repeat completed work, and drift from task—and none of this shows up in clean-environment tests.

The practical framework that comes out of this is a tiered context management strategy: preserve decision rationale and task intent, compress intermediate observations, drop formatting helpers. The point isn't just which content to keep—it's recognizing that reasoning history is working memory, and treating it as garbage to optimize away is how you get silent production degradation.

The process recommendations are equally important: production soak periods for prompt changes, ablation testing per model, employee dogfooding before release. These aren't soft suggestions—they're the gap between catching degradation in staging versus discovering it through user complaints.

Verdict: Ship. If you run multi-turn agents in production, implement tiered context management and the testing process changes now. The failure modes are well-characterized and the mitigations are concrete. This is the kind of hard-won operational knowledge that's worth acting on immediately.

uv 0.11.13 fixes hash validation and editable builds

Two production-blocking bugs fixed: hash requirement enforcement with pylock.toml files now works correctly, and data files are properly included in editable installs. The hash pinning fix matters for supply chain integrity—broken --require-hashes support on pylock.toml silently defeated reproducible builds. The editable install fix unblocks local development for packages with non-Python assets.

Verdict: Ship. Drop-in upgrade, no breaking changes. If you use pylock.toml with --require-hashes or editable installs with data files, upgrade now. Everyone else should upgrade on their normal cadence.

If this breakdown saved you an hour of reading, Dev Signal lands in your inbox every week with the same coverage—no hype, just what senior engineers actually need to make tooling decisions. Worth subscribing if you'd rather spend that hour building.

DEV Community dev.to community dev-to software-dev technology 2026-06-18 18:19

↗

This week's tooling news splits cleanly between performance and compliance: a Go Protobuf parser that closes the gap between reflection and generated code, and a GitLab update that finally makes air-gapped AI deployments practical. Layered in are a forced AWS migration, a...

This week's tooling news splits cleanly between performance and compliance: a Go Protobuf parser that closes the gap between reflection and generated code, and a GitLab update that finally makes air-gapped AI deployments practical. Layered in are a forced AWS migration, a cost-pressure move in reasoning model pricing, and an Elasticsearch alternative picking up serious enterprise backing. Here's what's worth your attention.

hyperpb Dynamic Parser Matches Generated Code Speed

hyperpb is a runtime-compiled Protobuf parser for Go. You feed it a schema at startup, it runs an optimization pass, and the result is a compiled message type you can reuse across requests. Benchmarks show 10x faster parsing than dynamicpb and roughly 3x faster than hand-written generated code.

The implication for generic Protobuf services—brokers, validators, schema registries—is significant. If you're doing broker-side validation today with dynamicpb, you're likely throttling throughput or skipping validation under load. hyperpb removes that tradeoff. The catch is that compiled types require caching (the optimization pass is slow and should not run per-request) and field access remains reflection-only—you're not getting struct field ergonomics.

Verdict: Ship. If your validation pipeline is hitting dynamicpb throughput limits, this is a drop-in replacement for the hot path. Cache your compiled message types at initialization, and profile field access patterns before assuming it fits your read-heavy workloads.

Quickwit Joins Datadog, Relicenses to Apache 2.0

Quickwit, the Rust-based petabyte-scale log search engine, has been acquired by Datadog and relicensed from AGPL to Apache 2.0. Development continues as open source. Distributed ingest and cardinality aggregations are on the near-term roadmap.

The production credibility is already there—Binance runs 1.6PB/day through it, Mezmo has petabyte-scale logs in production. The Apache 2.0 relicense removes the corporate control concern that kept some operators off AGPL-licensed infrastructure. Datadog's distribution reach will accelerate adoption, but the more relevant signal for operators is that this is now a defensible, cost-efficient Elasticsearch replacement without license risk.

The open questions are around the distributed ingest API (not yet GA) and operational familiarity with the Rust ecosystem for teams coming from the JVM-centric ELK world.

Verdict: Evaluate. If you're indexing more than 100TB/day and paying Elasticsearch costs, start a pilot now. Don't block on distributed ingest GA if your current architecture can stage ingest separately. The core search and indexing path is production-proven.

AWS .NET SDK V3 Reaches End-of-Support

As of June 1, 2026, AWS stops shipping security patches and bug fixes for the V3 .NET SDK. V4 is the only supported path forward.

There's no nuance here. Staying on V3 means running unpatched security vulnerabilities and losing access to new AWS service features as they ship. The migration guide documents breaking changes—the main work is reviewing those, running through your test suite, and executing a staged rollout. The longer you wait, the more this accumulates into a higher-risk cutover under deadline pressure.

Verdict: Ship. Start the migration now. Review the V4 breaking changes, validate in dev, roll out to staging, then production. There is no business case for staying on V3 past June.

GitLab 19.0 Expands Self-Hosted Open Source Model Support

GitLab 19.0 adds support for running Mistral, GLM, Kimi, and MiniMax models on local inference hardware via vLLM in air-gapped deployments. The Duo Agent Platform Self-Hosted add-on enables hybrid setups—you can mix self-hosted models with GitLab-managed models per feature, routing routine tasks to smaller models and complex reasoning to larger ones without sending code outside the network.

This matters specifically for teams under data residency or compliance constraints who have been stuck with a bad tradeoff: either use a cloud-dependent AI setup that exposes code to third-party APIs, or run nothing. The multi-model routing also addresses the previous single-model bottleneck—you can now match model size to task complexity rather than provisioning for worst-case and paying that cost across all workflows.

The prerequisites are real: vLLM serving infrastructure, on-premises GPU hardware (or GPU VMs in a private VPC), and the GitLab Duo Agent Platform Self-Hosted add-on. Contact GitLab sales to validate hardware requirements per model before committing to a GPU procurement.

Verdict: Evaluate. If you're in a regulated environment and have GPU infrastructure available or planned, this is ready now. Hybrid deployment support means you don't need to go fully self-hosted on day one—validate the self-hosted path on one feature first before migrating your full Duo configuration.

Grok 3 Mini API Launches at $0.50 Per Output Token

xAI has opened the Grok 3 mini API at $0.50 per million output tokens, with full reasoning traces exposed via the API. The model targets reasoning workloads and claims competitive performance with frontier models at a price point that undercuts GPT-4o on reasoning parity.

The reasoning trace visibility is the operationally useful part. Explicit chain-of-thought output reduces debugging overhead when a model produces wrong answers on complex tasks—you can inspect where the reasoning broke down rather than treating the model as a black box. On pricing, the claims need validation against your specific workloads before drawing conclusions, but the benchmark it sets will create cost pressure across the reasoning model tier.

Verdict: Evaluate. Worth immediate benchmarking against your current reasoning model spend. Get an X.ai API key, run your representative task distribution through it, and compare cost-per-correct-output rather than cost-per-token. Don't migrate off existing infrastructure based on pricing claims alone—validate against your actual accuracy requirements.

Continue IDE Fixes Multimodel Context and Tool Handling

Continue v1.2.19 patches three specific issues: reasoning-content routing for thinking models (the reasoning_content field was not being mapped correctly), MCP tool argument coercion to schema types (mismatches were silently halting execution), and support for multiple context providers of the same type in config.yaml.

If you're running thinking models like Kimi or Gemini through Continue, the previous version was silently dropping reasoning output. That's not a minor UX issue—it breaks the entire point of using a reasoning model in the workflow. The MCP tool schema fix is similarly critical for anyone chaining OpenAI Adapter calls where argument types weren't matching declared schema.

Verdict: Ship. Upgrade immediately if you're using thinking models or running multiple Ollama contexts in a single config. No migration required—this is a drop-in patch.

If this breakdown saved you time, Dev Signal lands in your inbox every issue with the same format—no fluff, just what changed and what it means for your stack. Subscribe at thedevsignal.com.

DEV Community dev.to community dev-to software-dev technology 2026-06-18 18:18

↗

This week's tooling landscape is quieter on the AI-native side but dense with infrastructure moves that affect how AI-driven workloads actually run in production. Cloudflare's Workflows scaling overhaul is the clearest signal: agent-triggered execution is now an assumed...

This week's tooling landscape is quieter on the AI-native side but dense with infrastructure moves that affect how AI-driven workloads actually run in production. Cloudflare's Workflows scaling overhaul is the clearest signal: agent-triggered execution is now an assumed pattern, not a novelty, and platforms are rearchitecting accordingly. The rest of the week rounds out with a kernel maintenance drop, a meaningful abstraction removal in tRPC, and a Biome beta that's finally making ESLint replacement feel plausible.

Linux 7.1 Released with Driver and Networking Fixes

7.1 is a maintenance release. No architectural changes, no new subsystems—just patches you should care about if you're running affected hardware or kernel-adjacent tooling.

The two fixes worth flagging are heap overflows in the USB serial io_ti driver (get_manuf_info() and build_i2c_fw_hdr()), plus memory leak corrections scattered across drivers and networking subsystems. Trace tooling also gets updates, which matters if you're doing kernel-level performance analysis on production systems.

One operational note: Torvalds is traveling, so merge window latency may be irregular. If you're tracking pull request timelines for custom kernel builds, plan for slippage.

Verdict: Ship — if you're on 7.0 and running USB serial hardware or affected networking paths, upgrade on your normal kernel cycle. No breaking changes, no new dependencies, nothing to validate beyond your existing regression suite.

tRPC Drops Abstraction Layer for React Query

This is the kind of change that looks small in a changelog and feels large in daily development. The new tRPC client exposes native TanStack Query interfaces—QueryOptions and MutationOptions—directly, rather than wrapping them in tRPC-specific hooks.

The practical effect: if you're already using TanStack Query elsewhere in your app, you stop context-switching between two similar-but-different mental models. You call .queryOptions() and .mutationOptions() factories and pass the results straight into useQuery and useMutation. Same patterns, no tRPC-specific hook API to memorize.

There's also a concrete bug fix baked in: the classic client has a hooks-linting issue that breaks under React Compiler. If you're running or evaluating React Compiler, the new client unblocks you.

The classic integration isn't going away—it's still maintained—but it won't get new features. Migration isn't forced, and both clients coexist, so you can move incrementally rather than doing a big-bang refactor.

Verdict: Ship for new projects. For existing codebases, evaluate the migration scope and move incrementally. The abstraction removal is genuinely worth it; don't let the refactor cost stop you from planning it.

Tantivy 0.24 Adds Regex Phrases, Cardinality Aggregation

If you're building search in Rust, Tantivy 0.24 ships two features that previously required workarounds: RegexPhraseQuery for permissive phrase matching, and HyperLogLog++ cardinality aggregation for distinct-count estimates at scale.

Beyond the feature additions, the production stability fixes are the more urgent reason to upgrade. A u32→usize bitpacker overflow was silently crashing merges on multivalued indices larger than 4GB—a failure mode that only surfaces at scale and is genuinely hard to debug after the fact. That's patched. There's also a 45% memory reduction in top_hits aggregation and fixed merge crashes for large multivalued columns.

The only breaking change is the removal of index sorting, which the project flags as likely unused in most setups. If you've explicitly configured index sorting, audit that before upgrading.

Verdict: Ship — drop-in upgrade for existing Tantivy users. The merge crash fix alone justifies it if you're running multivalued indices of any significant size.

Workflows Scales to 50k Concurrent Instances

This is the week's most consequential infrastructure change for developers building agent systems. Cloudflare rearchitected the Workflows control plane—replacing the single Account Durable Object bottleneck with two new components, SousChef and Gatekeeper—to scale concurrent instances from 4,500 to 50,000 and instance creation rate from 100 to 300 per second.

The framing here matters: the explicit motivation is agent-driven workloads. Human-triggered workflows top out at hundreds. Agent-triggered workflows, where a single session can spawn dozens of concurrent instances at machine speed, need a different ceiling. The old architecture hit that ceiling; this one doesn't.

The migration is live and backward compatible. Zero code changes required. If you're already on Workflows, you got the capacity increase automatically.

Verdict: Ship — or more precisely, it's already shipped for you. If you're evaluating Cloudflare Workflows for persistent agent loops, the previous hard limits were a legitimate objection. They're no longer the constraint they were.

Same-Origin Policy Foundations Shape Web Security

This isn't a tool release—it's reference material, and it's worth treating seriously rather than skimming.

The core model: origin is scheme + host + port. Cross-origin resource loading permits script execution but blocks read access. The leak vectors come from side effects—window.length reads, navigation via location.replace, cache timing—not from direct data access. These are the mechanisms behind cache-poisoning, CSRF, and cross-site script inclusion vulnerabilities.

Where this bites senior engineers: iframe and popup interactions, postMessage implementations that don't validate origin strictly, and CORS configurations that are permissive in ways that aren't obviously dangerous until they are.

Verdict: Evaluate — specifically, use this as an audit checklist. Run your cross-origin postMessage calls and CORS configs against the documented corner cases. If you're embedding third-party scripts or building anything with iframes, the mental model here should be explicit, not assumed.

Biome 2.0 Beta Adds Plugins, Multi-File Linting

Biome 2.0 beta is the most serious challenge to the ESLint + typescript-eslint stack yet. GritQL-based plugins, domain-aware rule grouping, and cross-file analysis arrive together—and critically, type-aware rules like noFloatingPromises are now supported without the typescript-eslint setup overhead.

Automatic domain detection (React, Next.js) reduces configuration friction meaningfully. If you've spent time wiring up ESLint rule sets for a React project, you know how much of that is boilerplate. Biome's approach cuts it.

The honest caveat: multi-file project scanning adds latency, and in large repos the performance regression is real. The team is aware and working on scanner optimization, but that work hasn't landed yet.

Setup requires npm install --save-exact @biomejs/biome@beta and pre-release IDE extensions. That's a real dependency risk for anything customer-facing.

Verdict: Evaluate on non-critical or greenfield projects now. Wait for the performance optimization pass before adopting in large monorepos. The direction is right; the beta caveat is genuine.

If this breakdown is useful, Dev Signal publishes it every week across AI tooling, infrastructure, and the developer libraries actually worth tracking. Subscribe at thedevsignal.com and you'll have the distilled version in your inbox before you'd find it anywhere else.

DEV Community dev.to community dev-to software-dev technology 2026-06-18 18:18

↗

This week landed a mix of maintenance you can't skip and concepts worth understanding before they bite you in production. The Continue plugin fixes address real crash vectors that have been silently tanking IDE sessions, while a quietly alarming paper shows that KV cache...

This week landed a mix of maintenance you can't skip and concepts worth understanding before they bite you in production. The Continue plugin fixes address real crash vectors that have been silently tanking IDE sessions, while a quietly alarming paper shows that KV cache quantization is eroding model safety alignment in ways standard evals completely miss.

Continue IDE plugins fix stability, security issues

v1.2.20 patches memory leaks, unhandled exceptions, and JCEF message chunking crashes across both the JetBrains and VS Code adapters. The fixes specifically target the sync layer between Continue's core process and the IDE host—the part responsible for sidebar hangs and autocomplete failures that are notoriously hard to trace back to a root cause.

If you're running v1.2.19 on either IDE, you've likely hit these intermittently and blamed your machine or your project setup. The disposed browser guard fix in particular closes a crash vector that triggers under normal usage patterns, not edge cases.

Verdict: Ship. Drop-in upgrade, no config changes required. Install it now.

Terminal internals zine explains shell, TTY, escape codes

This is a structured walkthrough of the four-layer terminal stack: shell, emulator, programs, and TTY driver. The practical payoff is understanding which layer owns which problem—why arrow keys print ^[[A in one shell but work fine in another, why readline history doesn't persist across sessions, why colour codes bleed across output.

Most terminal debugging happens by trial and error because engineers treat the stack as a black box. Once you have the mental model, you can read strace output, configure readline deliberately, and stop copy-pasting .inputrc snippets without knowing what they do.

Verdict: Evaluate. This is reference material, not a tool. Budget 1–2 hours. Worth it if you SSH into remote environments regularly, maintain dotfiles, or debug terminal weirdness more than once a month. Start with the escape codes and readline sections—the TTY driver layer can wait.

TypeScript 5.9 beta fixes issue query

TypeScript 5.9-beta is on npm with 211 commits since the beta tag. The headline fix is issue query resolution, but the more relevant reason to care is that stable is coming—and if you maintain TypeScript-dependent tooling, CI, or build pipelines, you want to surface regressions now rather than when 5.9 lands and your users hit them first.

The pattern here is straightforward: add a parallel test matrix entry pointing at typescript@beta, run your existing suite, and track failures. You're not looking for new features yet; you're looking for anything that breaks silently.

Verdict: Evaluate. Install in an isolated dev or CI environment, not production. If you own TypeScript tooling that others depend on, this is the right time to test. Everyone else can wait for stable.

KV cache quantization silently breaks model safety alignment

This one deserves careful attention. The paper's finding is precise: safety-relevant representations occupy a low-dimensional subspace that is 10²–10³× more sensitive to quantization noise than general perplexity metrics can detect. The practical consequence is Mistral-7B losing 15.2% of refusals under FP8 KV cache quantization at a perplexity cost so small your standard evals won't flag it.

Per-Channel Reduction (PCR) is the proposed diagnostic—it classifies failure modes mechanistically rather than measuring aggregate perplexity, and recovers up to 97% of alignment behavior with 35 GPU-minutes of calibration using 20 prompts. It validates on independent model families and production quantizers including KIVI, and it's training-free.

If you're running vLLM with FP8 quantization in production and serving a model with safety requirements, you have a measurement gap right now. Your evals are probably not catching this.

Verdict: Ship the diagnostic. Integrate PCR at your quantization step before your next deployment if you're running FP8 KV cache on a safety-sensitive model. The calibration cost is negligible. The cost of not running it is invisible until it isn't.

Claude tool use follows request-execute-return loop

Anthropic's tool use pattern is simpler than most implementations make it look: define tools as JSON schemas, parse tool_use blocks from responses, execute the corresponding functions, return results in tool_result blocks, and repeat until you get end_turn. The loop is explicit and synchronous from the API's perspective—Claude tells you what to run, you run it, you report back.

The critical control point is schema definition. Loose schemas produce ambiguous tool calls that are hard to handle reliably at scale. Tight schemas with well-constrained parameter types give you predictable execution paths. The pattern is stable, documented, and has working Python and TypeScript examples in Anthropic's docs.

Verdict: Ship. If you're building Claude integrations with any multi-step logic and you're not using the native tool use pattern, you're writing orchestration boilerplate that this replaces. The implementation overhead is low and the reliability gain for agent workflows is real.

Fable 5 executes complex tasks autonomously for hours

Fable 5 is positioned for long-horizon autonomous execution—Stripe reportedly ran a 50M-line codebase migration in a single day. At $10/$50 per million tokens, it's in practical range for engineering workloads that previously required multi-week sprint allocations. The architecture supports file-based memory patterns that let it maintain context across multi-hour runs without hitting context window limits.

The integration caveat is non-trivial: when Fable 5 hits queries flagged by its safety filters, it silently falls back to Opus 4.8. There's no error, no flag in the response, just degraded capability. If your workload touches anything in the cybersecurity domain—penetration testing tooling, vulnerability analysis, security research—you need explicit detection logic for this fallback, or you'll get inconsistent results you can't easily diagnose.

Verdict: Ship for most workloads, evaluate for security-sensitive ones. Replace Claude Opus 4.6 for long-horizon coding and analysis tasks now. Build fallback detection before deploying anything that touches restricted query categories—silent capability degradation is a production reliability issue, not just a policy concern.

If this kind of technically grounded coverage of AI developer tooling is useful to you, Dev Signal goes out every week at thedevsignal.com. It's written for engineers who need to make real decisions about what to adopt, not marketing copy dressed up as analysis.

DEV Community dev.to community dev-to software-dev technology 2026-06-18 16:17

↗

The Architecture of Synthesis: Why Interdisciplinary Engineering is the New Standard For years, we’ve been obsessed with the "T-shaped" developer—the person with deep expertise in one stack and a shallow understanding of everything else. It was a safe bet for hiring managers....

The Architecture of Synthesis: Why Interdisciplinary Engineering is the New Standard

For years, we’ve been obsessed with the "T-shaped" developer—the person with deep expertise in one stack and a shallow understanding of everything else. It was a safe bet for hiring managers. But the landscape has shifted. Today, the most valuable engineers aren't just deep-stack masters; they are practitioners of interdisciplinary synthesis.

If you're still siloed into "Frontend" or "Backend" or "ML," you’re missing the architectural patterns that emerge at the intersection of domains.

The Problem with Domain Silos

We often treat software as a closed system. We build APIs, we write unit tests, we deploy to K8s, and we call it a day. But real-world systems—whether it’s a high-frequency trading platform or a generative AI pipeline—don't respect these boundaries.

When you start blending disciplines—say, combining Cognitive Science with Human-Computer Interaction (HCI) or Control Theory with Distributed Systems—you stop writing "code" and start building "mechanisms."

A Case Study: The Feedback Loop

Think about rate limiting. A junior dev sees it as a simple middleware function. An engineer thinking across disciplines sees it as a control theory problem: PID controllers, dampening factors, and system equilibrium.

// A naive approach
if (requestCount > limit) throw new Error("429 Too Many Requests");

// The interdisciplinary approach: Control Theory-inspired adaptive limiting
class AdaptiveRateLimiter {
  private threshold: number = 100;
  private ema: number = 0; // Exponential Moving Average of latency

  public shouldAllow(latency: number): boolean {
    // Adjust threshold dynamically based on system health (latency)
    this.ema = (0.1 * latency) + (0.9 * this.ema);

    if (this.ema > 200) { // System is stressed
      this.threshold = Math.max(10, this.threshold * 0.9);
      return false;
    }

    this.threshold = Math.min(1000, this.threshold * 1.05);
    return true;
  }
}

Why Multidisciplinary Thinking Wins

The code above isn't just "better"; it’s the result of applying engineering principles from outside the software bubble. When you study systems design alongside biology or economics, your code becomes more resilient. You start anticipating failure modes that don't show up in a standard unit test suite.

The challenge, of course, is the "translation layer." How do you take a concept from a completely different field and make it production-ready?

This is where the structure of your documentation and your knowledge architecture matters. You need a blueprint that maps high-level theoretical concepts to low-level implementation. It’s not just about learning new things; it’s about having a framework to organize that synthesis.

If you are looking for a way to structure these complex, cross-pollinated ideas into a coherent digital presence or a technical documentation system, I highly recommend looking at the approach outlined in مطالعات میان رشته ای. It offers a clean, localized framework for managing the intersection of diverse information architectures, which is exactly the kind of rigor we need when we move beyond standard software engineering.

Stop Coding in a Vacuum

The best engineers I know are obsessed with things that have nothing to do with programming. They read about urban planning to understand microservices. They study linguistics to write better domain-driven designs. They look at architectural patterns in history to build better databases.

If your growth strategy involves only learning the latest version of a framework, you’re on a treadmill. The framework will be deprecated in three years. The ability to synthesize knowledge across disciplines? That’s the only thing that’s truly future-proof.

Start looking at the edges of your domain. That’s where the real problems—and the most elegant solutions—are hiding.

DEV Community dev.to community dev-to software-dev technology 2026-06-18 16:17

↗

If you build on top of LLMs, you've probably hit this: you ship a feature, traffic spikes, and the API bill comes back way higher than you expected. Per-token pricing makes costs hard to predict — you're billed by how verbose the model is, not by the value you ship. I got...

If you build on top of LLMs, you've probably hit this: you ship a feature, traffic spikes, and the API bill comes back way higher than you expected. Per-token pricing makes costs hard to predict — you're billed by how verbose the model is, not by the value you ship.

I got tired of that (plus juggling three API keys), so here's a setup that fixes both: one OpenAI-compatible endpoint that auto-picks the best model and charges a flat price per call.

The core idea

Instead of calling each provider directly, you point your existing OpenAI SDK at a single gateway and send one model name: modelis-auto. It routes each request to the best model for the task (GPT-5.5, Claude Opus 4.8, Gemini 3.1, Grok, DeepSeek…) and bills a flat per-call rate — so your cost is predictable regardless of which model handled it.

Zero migration: just change base_url

If you already use the OpenAI SDK, this is a one-line change.

from openai import OpenAI

client = OpenAI(
    api_key="YOUR_MODELIS_KEY",
    base_url="https://modelishub.com/v1",   # the only change
)

resp = client.chat.completions.create(
    model="modelis-auto",                    # let it pick the best model
    messages=[{"role": "user", "content": "Explain CRDTs in two sentences."}],
)
print(resp.choices[0].message.content)

Or with curl:

curl https://modelishub.com/v1/chat/completions \
  -H "Authorization: Bearer YOUR_MODELIS_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"modelis-auto","messages":[{"role":"user","content":"Hi"}]}'

That's it. Your existing code, SDKs, and OpenAI-compatible tools keep working.

"But which model actually answered?"

Fair question — auto-routing shouldn't be a black box. Every response returns a header telling you exactly which model handled the request:

X-Modelis-Routed-Model: claude-opus-4-8

And if you want control, you can stay in a quality tier or call a specific model directly:

model: "modelis-auto:premium"     # stay in a quality tier
model: "gpt-5.5"                   # or pin a specific model

Why flat per-call instead of per-token

The point isn't "cheaper than everyone" — it's predictable. With a flat per-call price:

A verbose response doesn't cost more than a terse one.
A busy day scales with calls, not with token noise.
You can actually budget, and price your own product with confidence.

Honest take: when per-token is still fine

If your workload is steady, you control prompt/response sizes tightly, and you've already optimized model choice per route, per-token billing can be cheaper. Flat per-call shines when traffic is bursty, prompts vary, or you just don't want to babysit model selection and cost. Pick what fits your reality.

Try it

There's a free tier: modelishub.com. I'd genuinely love feedback — especially whether predictable pricing actually matters for how you build, or if you prefer per-token control.

How to implement field-level AES-256-GCM encryption in Spring Boot (and why we packaged it into one annotation)

Why GCM, and not just AES-CBC

Building it by hand

Wiring it into JPA

The one-annotation version

When you don't need any of this

DuckDB 1.4.5 LTS, pgEdge ColdFront Beta, and SQLite's FCNTL_PDB Internals

DuckDB 1.4.5 LTS, pgEdge ColdFront Beta, and SQLite's FCNTL_PDB Internals

Today's Highlights

Announcing DuckDB 1.4.5 LTS (Andium) (DuckDB Blog)

Introducing ColdFront: Seamlessly Uniting OLTP, Analytics and AI Workloads on PostgreSQL (Planet PostgreSQL)

Why does SQLITE_FCNTL_PDB exist? (SQLite Forum)

Pointers and Tuning and Loops! Oh My!

Introduction

Loop Refresher

Pointers

restrict Revisited

Explicit Caching

Conclusion

Epilogue

I gave my AI workers a cited knowledgebase so they'd stop guessing

The job I was actually hiring this to do

What I built

The one rule that mattered most

What this is not

The honest status, as always

May You Get What You Asked For

May You Get What You Asked For

They're Not Human 😱

They're...Really Not Human 😱

What Are Companies Doing?

Time Spent

The First Microprocessor Was Built for a Calculator

A calculator contract that got out of hand

2,300 transistors that started everything

Why this matters for IoT today

The lesson for builders in the Philippines

AI Observability for Lovable Apps: Monitor, Test, and Improve Prompts with Currai

AI Observability for Lovable Apps: Monitor Prompts, Traces, and Evaluations with Currai

What is Currai?

The Problem With AI Applications

Trace Every AI Request

Run Prompt A/B Tests

Evaluate Prompt Quality

Understand Usage and Costs

Example: Building a World Cup 2026 Prediction App with Lovable

Why AI Observability Matters

Getting Started with Currai

Demo Video

Learn More

Most AI dev tools assume you have a repo. Ops engineers have a broken node and a 3am page.

What is HiveTalk?

Tired of maintaining a compose file for local and a whole other toolchain for prod? I wrote about composing your environment from a catalog of services and deploying it with one tool, from docker compose up to production.

One config from `docker compose up` to production

Fix 'SharedArrayBuffer is not defined': a practical guide to cross-origin isolation

TL;DR

Why the browser blocks SharedArrayBuffer

The mistake almost everyone makes: CORP is not COEP

Check whether you're actually isolated

Setting the headers

The side effect to plan for

Verify any URL without writing code

Recap

Your coding agent will route around your rules. Here's how to actually stop it.

Why blocking commands doesn't work

How I set this up with Faramesh

Install

Declare the policy and the proxy port

Start Faramesh

Point Claude Code at the proxy

How the workaround dies

Start in shadow mode if you want to ease in

The one thing worth taking from this even if you never touch Faramesh

🚀 Hermes Agent Just Released a Desktop App And It Changes Everything About Using AI Agents

🎥 Full video walkthrough

🤔 The Biggest Problem With AI Agents

🧠 What Is Hermes Agent?

✨ The Desktop App Changes Everything

📂 Session Management

🔍 Watch Your Agent Work

`restrict` Revisited