SCOTT: Okay. Check, check. I'll try to just hold it and not fidget around too much so my name is Scott and I'm going to talk about the Burrows-Wheeler transform physical so the Burrows-Wheeler transform was first published in 1994. AUDIENCE MEMBER: A little bit to your side. Don't hold it in front of your mouth. SCOTT: So it was originally published in 1994 by a book called A Block-Sorting Lossless Data Compression Algorithm. So let's unpack that. If you put data into it, you get the same data out of it. It restructures it so that patterns fall out is that make it compress better. And it's used for data compression because in most cases, the output compresses significantly better at passing through it. Data compression is all about finding repeated patterns that can be represented more succinctly. So what it does is take text and it can be applied to binary data, too, but it generally works out better for text and sorts it. It's not quite as good as this. I mean, if you look at this, you can look at that and say, a bunch of F's, a bunch of G's, some spaces in there, got it. And that kind of thing can be compressed down as well. But we can get pretty close to that. So here's the abstract for this doc. And here is the abstract of the talk run through it. And so, some things stand out here right away, like that first line there is almost entirely all T's and then the next line after that is mostly A's. And then the third line is almost entirely L's. And what it's doing is sorting the text but it's sorting it by the letter that occurs after each letter, which is kind of strange but what ultimately happens from that is that it groups it by the context in which that letters appears. And so you get a lot of cases where, if you sort all of the H's together alphabetically, a lot of them happen to be proceeded by T's because of that, and the, and then, and there, and this. And that sort of thing. And it will be broken up by a couple of other letters like the capital W in "wheeler" but mostly you'll have the same letters because the same combinations of letters money in text. And when I say "block" it usually applies to tens of kilobytes bytes at once. And you end up with stuff like that, which English majors in the room might recognize as Hamlet. So for the complete example of how this actually works I'm going to use the word "bananas" because it's one of the most repetitive most used words in local shared dict, which eventually ends up in every project somehow. So take the word bananas and then take all the rotations of the letters in bananas in a matrix like that, and then sort all of those lines lexigraphically. And then take the last column. And then ultimately you can throw the rest out. That's all we need. And we also save the starting position because we'll need that later to recover it later. So yay! We have some scrambled text, which is all well and good but the really, really clever thing here is that this whole thing is reversible. But how? You know, how would you recover that? What do we still know here? We have that last column. We've discarded everything else. But we know that the first column -- my slides are not advancing -- we know that the first column is the same as the last column except that it's sorted. So we can copy that here and put that back and we know that each row wraps around to the letter that goes immediately after it. So we have that sort of context to stitch it back together. And if you look at the data really closely, you'll see that these letter positions actually occur in the same order in those two columns. So for the position in the text, the first A, the second A, and the third A, in each column actually refer to the same "A." It's not like they're both A, they're actually the same A in the text, which means that if you go through all of that, that's all we need. So we are going to derive a version of that. And to start out, we need to number all of those columns. And it's paired up like that and we know that we start at the B because we've saved that output in the original stage. And we can wrap around from the B to the A. The first A and then we find the first A in the right column and we save the B, we save the B and then we wrap around from the A in the right column to the N in the first column and then we save that, and then we go to the A, and the N, and then the A, and then the S. And we have bananas again. So, what else is it good for? Well, I've already talked about using it for better compression and if any of you have used bzip before, before stands for Burrows-Wheeler transform. Which means that it needs a bit more memory to work in, and it needs a little bit more CPU but it means that it's particularly better at compressing text than gzip and it also pairs well with other transformations like move to front encoding which replaces individual bytes in a bytestream which takes into account how many times it occurred before. So the burst transform ends up with a lot of patterns like this, that has a lot of characters in them but it's still the same set of characters. These patterns kind of rhyme even though they're not the same. And if if you pass them through the encoding, they end up exactly the same, and if you take that pattern into account, it's squashed down really, really nicely. Also because the Burrows-Wheeler transform output is partly an original version of the text and a sorted version of the text, it's good for searching index and there's an algorithm called FM-index which is kind of an open research thing but also works with that. So let's implement it. You can do it in about ten lines of Python which fits on a slide that you can't read. It's on GitHub, there's the URL there. And I'll break it out here. To start, you take a string, if the string is zero or one characters, then just leave it alone, there's no point of shuffling a deck of one card. And then convert all of those positions to offsets. You can save a lot of memory by instead of rearranging different rotations of the same string just working with the positions and then derive the rest if you need it. And to compare any of those two positions, you compare the text sliced at that position and then wrap it around if need be and then sort that as a comparison with that method and then -- B is on the first line, typically it will be somewhere in the middle. And then return the last column slicing their, using all of those rearranged positions and that starting position. And there you go. And we can do the inverse in about 25 lines of Python which you really can't read. I can't read it on here, but again, it'll be online and likewise, we state string and that starting position which is crucial otherwise we'll end up somewhere in the middle and just loop around. If it's zero or one characters, we'll just leave it alone because we can't meaningfully do anything to it. Otherwise create an empty dictionary. We're going to use that to number the different occurrences of the letters so you can say, okay, this is the first N, the second N, the third E, et cetera, et cetera. And then number all of those letters and then make a copy of it, and sort that so that we have an S column and the first column with the numbers associated with them. Pair those together and then here's a function, that given a letter and its count will find the corresponding one that the other column, which, this just uses linear search. This can be made a lot faster, this just checks by row by row by row if. Which for this example is when. So for this empty array, we collect it, and step through it one-by-one, one-by-one. So this works pretty well but there's room for improvement. The transformation spends most of its time in that sort, and there's a bunch of things that can speed it up, in particular there's a data structure called a suffix array which takes advantage of knowing that they're all different subsets of the same string. And it can make that sort a lot faster because it. And the inverse can be sped up in a lot of ways having to do with that look-up, whether it uses more memory to have a look-up table, whether it uses a small amount of memory for tracking the counts for each letter R, and there's a lot of room for time versus space offset. And there's also other slightly different approaches for implementing this. Sometimes you'll see the special EndOfString marker. That's not necessary but it sometimes makes as the simpler to make one more working space for the other. Okay. Thanks. JULIA: Thank you very much. Next up, we have -- this is where I look at the schedule. Eternal sunshine of the spotless search index by Fiona Condon. Who's somewhere...? There. And this is, like, one of my favorite abstracts because -- you should come here and look something up. I'm going to ramble for 30 seconds. I love it because it starts with search engines are designed for memory but sometimes you need to forget. FIONA: Thanks. I'm going to try to put it on the stand. How does this sound? Am I okay? Sorry. This is the setup. So I want... okay. Cool. Just turn on the thing. Okay. Sorry. Hi there. I'm Fiona and I'm a search engineer and I'm going to talk to you about search engines and nail polish. So explain a little bit about how those are connected. I used to have a nailer problem. This photo was from the peak of my nailer phase, when I get 100 Pinterest posts on my nails. I was spending too much time on my nails, and too much money on nail polish. And my cuticles were destroyed. I had my browser block my subreddit. I had a diary, which is stuff like this, tips and tricks and stuff about nail polish that would totally just remind me about how much I like it, and I would stumble upon it accidentally on purpose, when I was missing my hobby, but I didn't want to delete them entirely for posterity. And so I wrote a search repository in Python called Acetone. These are stored in file directories in plaintext. And it doesn't use anything like Solar or ElasticSearch. And my goal is to not impress you that I made it from scratch but that search engines are really beautiful and simple. So what's a search engine? I put in a text query, and I get documents. But unlike grep, it doesn't scan documents for every query, it leverages a preprocessing step called indexing to create files, or a set of files to -- I don't really care about performance for these really local plaintext files but I'll be taking advantage of this preprocessing stuff to hide some of the nail polish stuff from myself. So imagine that we are indexing some nail polish names. So what this preprocessing step actually does is that it kind of steps through every document and it looks at every term and then it creates a mapping from the term to the list of document IDs that contain that term. And I use this example because nail polish names are both very short and they're also very weird and hilarious. These are all real nail polish names. So once it's done this, and built out these lists, now, we have a mapping from -- so like about is in documents one, three, and four, and so you see a list of one, three, and four. And then same thing for pink, two and three. So now when I look those up, I see pink and I return those two documents. And it's not much harder to handle two terms. So for a query like pink about. We just grab both lists and we do a set intersection and now, we have that one document that matches both the words "pink" and "about." Cool. So, the one thing that is not so great about this system is that there's really two documents here that are relevant to the query of pink and about. Both that pink about it, which returned pinking about you. A lot and this is kind of a contrived example but in general you kind of have this problem where you have something that's pluralized in the text, but you search for the singular or vice versa, you're not going to retrieve that because we only have these kind of naïve document mappings with the original term. So we've got to do some text normalization. So this is called the analysis chain. So back to my original entry. What we've got here is just a big blob of natural language text and then the first step in the analysis chain is called tokenization and that just takes that big text blob and makes it a stream of words. So now we have a nice stream of individual tokens so the second step is stop words which drops out less common less semantically meaningful words, like is, too, I, and that keeps the index size manageable without affecting that many queries. And then finally we do stemming which is the heavy lifting of text normalization. So a stemmer is a rule that iteratively strips off suffixes and tries to find the essence of the root of the word. So pinking which was that problem from the nail polish example becomes pink. So now, we can look this up for pink. So this step has to happen to both of the tokens in the document as it happens to the indexer, and to the token in the query, and I would get pinks, that would be match anything because it got dumbed down to pink. So we've got this nice chain and then we've got this robot-ready text that's ready for indexing. So now I'm going to talk about the mods that I've done to make this safe search mode for nail polish. So the goal of number one is I know some words that I shouldn't be searching. Acetone, topcoat, and SCI, those are nail polish terms that I shouldn't be searching them. Soy might as well not be looking at them. Stock words in the analysis chain, I can put those in the four. But I think we can actually do one better. Especially words like nail polish brands like SE and OPI are often followed by the full nail polish name and I shouldn't be searching for that. So I'm going to intervene on the tokenizing step and if I see something like acetone I'm going to drop it, but if I see something like SE, I'm also going to drop the next three tokens. So this is how this stream would be transformed and this is the code. So this is the original tokenizer and it's super simple thanks to the generator pattern in Python. I just read in a line, break out the white space and then now I have a nice stream of tokens that I can use in the indexer. So now what I'm going to mix in here is a blacklist that iterates how long I should drop a word after the tokenizers. Now the tokenizers are complex but it's still fairly simple so if it sees something in the blacklist -- it sets a discount starts counting down and it only begins yielding tokens again once it's reached zero so this gets us to the original tokenized document looking like this. And post this step it looks like this. So we've dropped top coat, acetone, but we've also dropped the full nail polish name. So this is pretty good but I still see some terms in there like glitter and chip that are not nail polish specific but pretty nail polish relevant. So I don't want to totally remove them from the index but I do want to make it a little bit harder for myself to find them. And so I was really inspired by this Ciara video where she changes her ex's name into "old news" on her phone. But I really liked this example by making this a cautionary alias for the stemmer. So basically what I'm going to do is use alias terms that I'm moderately confident are about nail polish in a way, to a rhyming scheme that will be queried in the time of alias. So I've got is this rhyming scheme, that contains terms that reminds me not to make me too nostalgic. So it would rhyme lacquer with whacker, and -- and the key insight is that this time, unlike every other stemming thing which needs to happen both at query time and at index time, I only wanna do a this transformation at index time because if I did it in both, then I would search lacquer, it would get transformed to whacker, and it would transform those mappings and I wouldn't have changed those results at all. So now the post stemming document looks like this. And after the aliasing step we've got these aliases requesting litter, quaint, whatever. So I won't be able to look these things up without remembering the alias. Cool. So this brings me to the final modification. So, so maybe you haven't noticed that I have gotten even close to the terms "nail" and "polish" which are the most relevant terms. And that's because I have documents like this, that contains the word "nail," and a document like this like polish, or Polish. So what I need is a way to blacklist a phrase and not a word but I can't really do that when all I have is maps from terms to sets of IDs. I won't know where the document -- the terms actually appear. So what I really want to be able to do is something like this when I have a query, I want that query and not any documents containing the phrase, nail polish. So now, I need to refer to the tokenizer again, so refer to the position in the document. So I'm going to yield as a tuple as an integer in the context of the document and the token itself and construct a, kind of modified postings list where instead of a term mapping to a list of IDs, I have a tell me mapping to a dictionary from that ID, the document ID to the list of positions in the document. So now e--nuf shows up at positions one and three, and "is" is at position two. And so, this complicates the searcher step a little bit, because now, doing an intersection of sets was fine but doing an intersection of dictionaries just doesn't make that much sense. This is not going to work. So I'm not going to linger too much on this because it's somewhat complicated but with the right abstractions with iterating through a list and matching predicate, this list can be actually pretty simply we've got that phrase merge across when you have more than two terms. And then now it works something like this. Well, I'll wait for it to restart but it used to be when I searched for "nail," I would see both documents, including the nail polishy one, but adding back in that code, all of a sudden, I'm able to limit it to the document that -- that I'm safe to look at. So thank you so much for listening. I'll be around if you have any questions about search engines or if you want to know my pick for the greatest quick dry top coat ever. So thank you.