>> So hi. I'm David Turner, and I love
programming. Recently I was hacking on
git. And I wanted to sort a list of
paths. There was a constraint. The paths
had to be sorted topologically. So that
foo comes before foo/bar comes before
foo/bar/baz. And nothing comes between
the first and the last one. Ascii order
won't work, because dash comes before 
slash, so foo-fleem comes between
foo and the first file inside foo.

This is relatively straightforward to 
solve. We just reorder the alphabet a
bit so slash is second immediately after 
zero, which is used to indicate end of 
string, and everything from one up to 
slash would move up to make room. So 
we can do it with a couple of if/else 
statements. We can also use a lookup 
table, which is slightly faster, but it 
wasn't fast enough. The comparisons were 
a hot spot in my code. So I decided to 
change the code entirely so I didn't 
need to sort.

But what I considered first was to use
the single instruction multiple data
instructions available on modern
processors. This is the crazy fun way to
make code fast. Processors store data
that they're working on in registers. 
These are your processor's version of 
variables. Ordinarily one variable holds 
one value. On a 64 bit processor that's 
a single 64 bit value. But with SIMD 
instructions they can be treated as an 
array of independent values. So for 
example, you can treat 128 bits as four 
32 bit integers or eight 16 bit or 16 
bytes. Each element stays independent. 
So if you're adding two arrays of bytes 
and one byte overflows, it doesn't carry 
into its neighbor. So I can process 
sixteen bytes of my string at a time. 

Because strings are a little bit
complicated, let's walk through an
example of SIMD with pixels. So imagine
we have an image which has 32 bit
pixels, 8 bits each for red, green, and
blue, and 8 bits of padding. And we want
to lighten it: add 40 to each channel.

Heres' the code to do that. Let's 
unpack that. First we'll load four 
pixels into a register to work on them. 
And to read this we'll start with mov 
which means copy data, dq means 16 bytes, 
and a means aligned to 16 byte boundary, 
which is faster. rdi is this ordinary 64 
bit register and by convention it's used 
to pass the first argument to a function.
The parentheses mean to get data
from that address, like a pointer
dereference in C. And this is the first of
our registers. So we pass the point into
this function, dereference it and put
into the function. Easy. We also need
the data to lighten it. That's going to
be 40 twelve times. Our data takes a second
argument which points to this. So we can
do 40 in four red channels, so we're
going to add this lightening data right
in. Right? And again, this is a
complicated chunk here, but to read it
out -- p for packed, add, u for
unsigned, s for saturated, b for bytes.
So saturated means if we hit the top,
right 255, we don't wrap around to zero
as we would ordinarily do. We stay at
255. That's what we want. If our pixel
is as bright as it's going to get,
lightening it won't do anything. Now we
have rsi. Which is the second argument
to the function. Here's the result.
And of course, we need to put that
pixel back into memory. And that's just
the same as the first instruction with
the operands swapped. So we transformed
four pixels in three instructions. Not
bad!

But our string function is more
complicated. It didn't just have
addition. It also had an if statement.
And so we're going to look at a
simplified case here, which is just to
replace all the slashes with ones. So
how does your processor do an if
statement? You do a comparison which
sets some flags and do a jump, a goto,
that's conditional on those flags. But
we can't do that with SIMD, because some
comparisons might be true and others
false. We can't both jump and not jump.
We have to get the effect of if then
else without doing jumps. So we have to
do bit-wise arithmetic. I'm going to
show you this visually. So we take ones
are true and zeroes everywhere else.
Here ones are white and zeroes are
black. Recall that OR giveth and AND
taketh away. AND with zeroes gives you
zeroes. You AND the else value with the
inverse of the mask. And then you use OR
to combine them. Notice that the
operands to the or each has black zeroes
where the other has color. We OR them to
create a complete image. ORing with zero
gives you that back. So think of a
dovetail. 

So let's implement a piece 
of our string function this way. We have 
some register containing 16 bytes of our
string. We want to have a one where
the bytes previously contained a slash.
Let's assume we have a register xmm7
that holds a slash repeated 16 times,
and we have 1 bytes, xmm6, and one 
bits not bytes in xmm5,
and finally have our string in xmm0.
We're going to save a copy of our
string, and we'll do this byte-wise
comparison between our register and the
register with all the slashes. Packed,
compare, for equality, b for bytes. It's
only got two arguments between the
things that it's comparing. Where does
the output go? It's a good thing we made
a copy. Any byte where they're equal
gets filled up with all ones and where
they differ gets filled up with all
zeroes. So we started with the string
foo/bar/example and our mask has ones 
where the slashes were and zeroes
everywhere else. Which is good. Now
we're going to copy that mask. And we'll
build the part of the final value that
contains the ones. Remember that xmm6 
contained the one bytes. So we use AND. 
And now xmm2 has one bytes where the 
slashes were. And now we need to build 
the other part of the final values. We 
need the opposite of that mask. There's 
no SSE NOT instruction. So we need to 
use XOR. XOR with 1 inverts so that's 
what we want. Now we're going to AND 
that with our original string, which 
was xmm0, and now xmm0 has zeroes where 
the slashes were. Right? So you'll 
notice that xmm0 has zeroes, holes, 
exactly where xmm2 has ones. Notches, 
let's say. And that's what we want. 
So we can use OR to put them right 
back together. And now we've reached 
our goal. The slashes were
replaced by ones. We only have seven
instructions to process 16 bytes. 

The one wrinkle in all this is we're using C
style zero terminated strings, so if
we're comparing two strings, we need to
know when either of them ends. And
conveniently, Intel has given us an
instruction that does precisely that.
PCMPISTRI. So that's packed, compare,
implicit, returning index. It can be
explicit if you happen to know the
length of your string. This is an
instruction designed explicitly for the
strcmp function. It's got bells and
whistles that you can use for more
advanced things. So you want to find the
difference, but only if the difference
is before the first zero in the string.
And we want to find if the other string
has a zero in it. Conveniently,
PCMPISTRI does this all for us. Puts the
index of the first real difference into
on ordinary register, by real here, I
mean before a zero. So inside the actual
strings. And then sets some flags. The 
carry flag if there are any real 
differences. And the zero
and sign flags if there are any zeroes
in the arguments. In our case, any of
those flags should be enough to get us
into the routine that tells us which
string is data. If none of them are set
we can move to the next 16 byte block of
our string. So this a little bit
complicated, because we have the values
in the SSE registers but we need to
extract the byte where the first one is.
We have the index of that byte. But
there's not an instruction that gives us
the byte for the index. So after some
searching, I found a thread
where some people figured out how to do
it, with the shuffle instruction.
Unfortunately I'm going to run out of
time to show you the details, but if
you're interested, look up these slides
online and keep watching past the
credits. So this AND/OR thing and the
shuffle trick are the sort of weird
creativity that Intel's wacky
instruction set encourages. So it's a
perversely pleasurable way to program,
because you end up with this
tremendously efficient code, but you
have to really fight for it. And you can
actually get even deeper, when you
consider the timing of instructions. So
what happens when one instruction
depends on data computed by another
instruction? Maybe it has to wait, and
in that gap while it's waiting, you can
start processing the next chunk of
pixels, and now you're interleaving the
two computations and you start to run
out of registers, so you have to be
clever about reusing registers. Thank
you!

(applause)