If you're in the business of building things that run on computers long enough, I think you will eventually acquire a favorite bug story. This is a short story about mine. I've also built an interactive tool where you can explore the concepts underpinning the heart of this bug. The bug: two emoji enter, none leave I was working on migrating a legacy editor to a more collaborative experience with my team. TipTap on top (itself a wrapper around ProseMirror), Yjs underneath handling the CRDT magic for real-time syncing. It worked well! Mostly. In our alpha/early release days, when it was still mostly internal and/or early rollout users, sometimes the editor would just stop saving your content. Silently. You'd keep typing and everything looked fine, but your edits stopped syncing to the Yjs document. The next time you opened the page, everything you'd written after the failure point was gone. It was utterly terrifying, very rare and almost impossible to diagnose because we could never recreate it. We really tried! My early suspicions generally revolved around shaky wifi connections and wonky websocket behaviors, but no amount of throttling or turning my wifi on and off seemed to recreate the issue. The experience was surprisingly resilient in those scenarios, in my memory. It felt like it happened randomly, never when anyone was looking. No obvious errors picked up in the console, no stack trace, no crash. Just... "Hey, I think my changes didn't save." Then one day our product manager cracked it. This was not a trivial thing to find. He'd been experiencing it more than anyone else (probably because he was the best at dogfooding our product) and had been methodically narrowing it down. "I feel like I'm going crazy, but I think it's when I type specific characters together, go back and insert a character between them..." He'd been using 🟢 and 🔴 in his weekly project status emails to communicate general health. Green for on-track, red for at-risk. Every week the template he was using had both characters already present and he would simply remove the one he didn't need (Generally the red one, I am happy to say!). On this occasion he'd copied the green circle and pasted it in front of the red one at some point, or maybe vice versa. That specific operation— inserting one multi-byte emoji adjacent to another— was triggering a splice in the underlying CRDT library, which split a surrogate pair down the middle. I remember being on the call when he showed this to me and one of my direct reports who'd been toiling away at the collaborative editing transition. I must've gotten a little too excited—I live for esoteric bugs:""I feel like you got energized by this," he said. He wasn't wrong. Adding to the fun, not every emoji triggered it. Only the ones above U+FFFF that required surrogate pairs. And not all edits resulted in the problem either—only the ones that caused a splice at exactly the wrong byte offset. It was a wild one to debug before we knew what was going on. Code units, code points, and grapheme clusters So what was going on? What does "ones above U+FFFF" in that last paragraph even mean? What byte offsets? To understand this bug we need to introduce three pieces of vocabulary: Code Units → Code Points → Grapheme Clusters Code units are the raw 16-bit values that JavaScript uses to store strings internally (UTF-16). This is what .length counts. This is what .slice() and charCodeAt() operate on as well. JavaScript operates at the code unit level by default Code points are what Unicode actually defines as a single character. A code point like U+1F920 (🤠) is one character in Unicode's view, but it's too big to fit in a single 16-bit code unit. So UTF-16 splits it into two code units called a surrogate pair: a high surrogate and a low surrogate. Simple ASCII characters and a lot of common symbols fit in one code unit, so the distinction doesn't matter for them. Emoji, though? Almost always two. Grapheme clusters are what a human perceives as a single character. This is what you see when you look at a string in a text editor. It's not necessarily what's stored in memory, though. In the case of our bug, the grapheme cluster was two emoji, but the code units were a surrogate pair. The splice operation was splitting the surrogate pair at the wrong byte offset, resulting in the bug.
Comments (0)
Login or Register to apply