FIONA: Thanks. I'm going to try to put it on the stand. How does this sound? Am I okay? Sorry. This is the setup. So I want... okay. Cool. Just turn on the thing. Okay. Sorry. Hi there. I'm Fiona and I'm a search engineer and I'm going to talk to you about search engines and nail polish. So explain a little bit about how those are connected. I used to have a nailer problem. This photo was from the peak of my nailer phase, when I get 100 Pinterest posts on my nails. I was spending too much time on my nails, and too much money on nail polish. And my cuticles were destroyed. I had my browser block my subreddit. I had a diary, which is stuff like this, tips and tricks and stuff about nail polish that would totally just remind me about how much I like it, and I would stumble upon it accidentally on purpose, when I was missing my hobby, but I didn't want to delete them entirely for posterity. And so I wrote a search repository in Python called Acetone. These are stored in file directories in plaintext. And it doesn't use anything like Solar or ElasticSearch. And my goal is to not impress you that I made it from scratch but that search engines are really beautiful and simple. So what's a search engine? I put in a text query, and I get documents. But unlike grep, it doesn't scan documents for every query, it leverages a preprocessing step called indexing to create files, or a set of files to -- I don't really care about performance for these really local plaintext files but I'll be taking advantage of this preprocessing stuff to hide some of the nail polish stuff from myself. So imagine that we are indexing some nail polish names. So what this preprocessing step actually does is that it kind of steps through every document and it looks at every term and then it creates a mapping from the term to the list of document IDs that contain that term. And I use this example because nail polish names are both very short and they're also very weird and hilarious. These are all real nail polish names. So once it's done this, and built out these lists, now, we have a mapping from -- so like about is in documents one, three, and four, and so you see a list of one, three, and four. And then same thing for pink, two and three. So now when I look those up, I see pink and I return those two documents. And it's not much harder to handle two terms. So for a query like pink about. We just grab both lists and we do a set intersection and now, we have that one document that matches both the words "pink" and "about." Cool. So, the one thing that is not so great about this system is that there's really two documents here that are relevant to the query of pink and about. Both that pink about it, which returned pinking about you. A lot and this is kind of a contrived example but in general you kind of have this problem where you have something that's pluralized in the text, but you search for the singular or vice versa, you're not going to retrieve that because we only have these kind of naïve document mappings with the original term. So we've got to do some text normalization. So this is called the analysis chain. So back to my original entry. What we've got here is just a big blob of natural language text and then the first step in the analysis chain is called tokenization and that just takes that big text blob and makes it a stream of words. So now we have a nice stream of individual tokens so the second step is stop words which drops out less common less semantically meaningful words, like is, too, I, and that keeps the index size manageable without affecting that many queries. And then finally we do stemming which is the heavy lifting of text normalization. So a stemmer is a rule that iteratively strips off suffixes and tries to find the essence of the root of the word. So pinking which was that problem from the nail polish example becomes pink. So now, we can look this up for pink. So this step has to happen to both of the tokens in the document as it happens to the indexer, and to the token in the query, and I would get pinks, that would be match anything because it got dumbed down to pink. So we've got this nice chain and then we've got this robot-ready text that's ready for indexing. So now I'm going to talk about the mods that I've done to make this safe search mode for nail polish. So the goal of number one is I know some words that I shouldn't be searching. Acetone, topcoat, and SCI, those are nail polish terms that I shouldn't be searching them. Soy might as well not be looking at them. Stock words in the analysis chain, I can put those in the four. But I think we can actually do one better. Especially words like nail polish brands like SE and OPI are often followed by the full nail polish name and I shouldn't be searching for that. So I'm going to intervene on the tokenizing step and if I see something like acetone I'm going to drop it, but if I see something like SE, I'm also going to drop the next three tokens. So this is how this stream would be transformed and this is the code. So this is the original tokenizer and it's super simple thanks to the generator pattern in Python. I just read in a line, break out the white space and then now I have a nice stream of tokens that I can use in the indexer. So now what I'm going to mix in here is a blacklist that iterates how long I should drop a word after the tokenizers. Now the tokenizers are complex but it's still fairly simple so if it sees something in the blacklist -- it sets a discount starts counting down and it only begins yielding tokens again once it's reached zero so this gets us to the original tokenized document looking like this. And post this step it looks like this. So we've dropped top coat, acetone, but we've also dropped the full nail polish name. So this is pretty good but I still see some terms in there like glitter and chip that are not nail polish specific but pretty nail polish relevant. So I don't want to totally remove them from the index but I do want to make it a little bit harder for myself to find them. And so I was really inspired by this Ciara video where she changes her ex's name into "old news" on her phone. But I really liked this example by making this a cautionary alias for the stemmer. So basically what I'm going to do is use alias terms that I'm moderately confident are about nail polish in a way, to a rhyming scheme that will be queried in the time of alias. So I've got is this rhyming scheme, that contains terms that reminds me not to make me too nostalgic. So it would rhyme lacquer with whacker, and -- and the key insight is that this time, unlike every other stemming thing which needs to happen both at query time and at index time, I only wanna do a this transformation at index time because if I did it in both, then I would search lacquer, it would get transformed to whacker, and it would transform those mappings and I wouldn't have changed those results at all. So now the post stemming document looks like this. And after the aliasing step we've got these aliases requesting litter, quaint, whatever. So I won't be able to look these things up without remembering the alias. Cool. So this brings me to the final modification. So, so maybe you haven't noticed that I have gotten even close to the terms "nail" and "polish" which are the most relevant terms. And that's because I have documents like this, that contains the word "nail," and a document like this like polish, or Polish. So what I need is a way to blacklist a phrase and not a word but I can't really do that when all I have is maps from terms to sets of IDs. I won't know where the document -- the terms actually appear. So what I really want to be able to do is something like this when I have a query, I want that query and not any documents containing the phrase, nail polish. So now, I need to refer to the tokenizer again, so refer to the position in the document. So I'm going to yield as a tuple as an integer in the context of the document and the token itself and construct a, kind of modified postings list where instead of a term mapping to a list of IDs, I have a tell me mapping to a dictionary from that ID, the document ID to the list of positions in the document. So now e--nuf shows up at positions one and three, and "is" is at position two. And so, this complicates the searcher step a little bit, because now, doing an intersection of sets was fine but doing an intersection of dictionaries just doesn't make that much sense. This is not going to work. So I'm not going to linger too much on this because it's somewhat complicated but with the right abstractions with iterating through a list and matching predicate, this list can be actually pretty simply we've got that phrase merge across when you have more than two terms. And then now it works something like this. Well, I'll wait for it to restart but it used to be when I searched for "nail," I would see both documents, including the nail polishy one, but adding back in that code, all of a sudden, I'm able to limit it to the document that -- that I'm safe to look at. So thank you so much for listening. I'll be around if you have any questions about search engines or if you want to know my pick for the greatest quick dry top coat ever. So thank you.