SIMON: So hi, everyone. I'm Simon Sapin. I work at Mozilla Research on which is a new rendering engine and it's written in Rust and as you heard, Rust was released as 1.0 just this Friday, and when you open a file in Rust, the way that the file name is represented in memory is a bit unusual and I think it's interesting so I want to talk about that. But let me give you some background on how to deal with text. All CPUs know numbers but don't really know how to deal with text so we need a convention to deal with text with numbers. And of course there's more than one way to do it. And in the early days, most of these encodings include could only support some sets of writing systems. So if I wanted to write Japanese I had to use a Japanese encoding and I could probably not had Korean in the same document. So what can we do with that, of course, we make a new one that supports everything and that's exactly what Unicode did when I was 1-year-old. And they decided to have 16 bit characters which allows up to 65,000, and some characters. And I say "character" here but it's a bit more complicated than that in Unicode. You need unique components of characters and gives that references symbols. And even at the time, there were some concern that 16 bits would not be enough. But the Unicode Community really wanted to keep it simple and even on page two of the first entry of the chapter in the book, because it was a book at the time, they had this diagram that says, here's a bunch of the space that we use for allocated characters, and how much we have left. And they say, we have space for future foreseeable expansion. Guess what happened. So meanwhile, Unicode goes updating many systems. And these systems use two bytes per character because that was the good thing to do. And at the time that was really what we called "Unicode." But later, Unicode became more than just encoding so now we call that particle encoding with two bytes per character, UCS-2. Meal meaningful people didn't like this two bytes per character, and they came up with UTF-8. Each one of the characters is represented by one of these byte characters. And what's interesting is that it's variable by size. So that allows up to 31 bits of space. And in Unicode version two, they finally realized that it's really not good enough for 16 bits and so they introduced this system called UTF-16 which is really like using UCS-2, so much that you don't really have to do anything in your software to use UCS-2. The idea was that there were these range of values that were not used that we now called lead surrogates and trail surrogates and when you have lead followed by trail it's a surrogate pair and together that pair represents a single character. And that goes up to a million and some characters. But the thing is, Unicode version two did not allocate any characters to the new supplementary characters. So there was no incentive to really looking at it. And really even if you do it right, most code doesn't need it either. It's only when you need to run text in different fonts. And the problem is what happens when you have a surrogate that's not in the pair? That doesn't really represent anything. But no one really knows what you should do in that case. So in many cases, nothing happens. You just have something that's not text. And so also in Unicode version two, they generalized the mechanism of having abstract characters assigned to numbers and that's separated from the way that these numbers are encoded into bytes and there are multiple encodings in Unicode, UTF-8, UTF-16, and these other encodings were artificially restricted to encode as values of UTF-16. So even though Unicode can go up to 31 bits, it was restricted to have this range of use. And in particular this range in the middle are used for surrogates that cannot be used in Unicode so when you have this byte sequence in UTF-8 that would have been a character previously but now it's considered ill formed, it's not a valid character. Let's go back to Rust. This is actually a finished string time in the Rust directory. It contains eight bits of assigned integers, and that is private so you have to use the API to access it, and the API enforces that they're in UTF-8. And that's really good when dealing with international text to do things correctly so you don't have emoji, or character encodings arising in your programs but when talking to the outside world, it's not so nice. In particular in the operating system APIs, any strings like filenames, and all that stuff does not do well in Unicode. Unix doesn't really care, it's always bytes but on modern systems it's almost always UTF-8. And in Windows, it was once UTF two once upon a type of. But now it's UTF-16, but not always. You can see sometimes you'll have the filename in your file system is a surrogate, and not a character. We want this to be cross platform so when you apply, you don't have to care about these differences between platforms. So how can we do that and how can we support any kind of filenames, for example? So Rust has this type called OSString that contains any value that the OS can give you or you can give to it, and again it's just bytes. And in Windows, you could have vectors of 16 bit digits which is what happens in the OSCPI's but we would rather have something closer to UTF-8 to have cheaper conversions to and from the native string time. And and we have if you devote a lot of memory to copying, that's also not good. So why use UTF-8 for this? When you have a normal character, it's only two bytes in UTF-16 that's fine to convert into UTF-8. If you have a supplementary character that's surrogate pair, that's essentially fine. The problem is when you have a surrogate that's not in the pair. But again, in the early days of UTF-8, that was capable of encoding that values. So what if we did anyway? So that wouldn't really be UTF-8, so we should really give it another name. And someone came up with that name, which I really like. Someone called it that. And this is a superset of UTF-8 where you can also have surrogates but only if they are paired which means that when you concatenate two strings, in case that you would have lead surrogates at the end of the first and a trail at the beginning of the other one, that would form pair, and we don't want that to happen. So you have to have a special case here to transform that into a single supplementary character. And I think this existed before in Scheme 48 and in Racket, but it was a probably not fine. And so I wrote a spec. It defines everything with proper definition, all the edge cases. And that's the UTF-8. [ Applause ] So I'll be here if you want to talk about this Unicode, Rust, or the things. This is my username.