>> That was super fun. So next we have
Mark Wunsch talking about map, reduce,
awk!

 >> Give me one second. That was a
really cool talk. Fast GPD, too. Okay.
Everyone hear me okay? Okay. I'm Mark.
I'm going to talk about mapping, and
reducing, and doing them both together
at the same time with a tool that I
really like to use, named awk. So to
start, I have a log file from Amazon S3.
And whoa... You see that number? Wow,
that's a lot of bytes. That's, like,
that many bytes. That's big data, right?

I'm like... Wow, for big data, I'm going
to need some kind of Hadoop or something
like that. Now, I'm not a data
scientist. I'm going to need to brush up
on my reading. So... All of these books
with animals on the cover... In order to
do data science. Because it's a well
established vocabulary and tool set for
doing big data. And I'm really ready to
spin up an elastic map reduce cluster.
Because apparently that's a thing.

But I've been misled. Apparently, 13
megabytes is not big enough to be
considered big data by the kinds of
people who determine these sorts of
things. I've mistakenly assessed the
problem. I'm in a big data fog, and I'm
overwhelmed by so many options. So I
almost invested in learning this entire
pantheon, in order to figure out how
many people downloaded my podcast.

So this is overwhelming. Um... Small
data is almost always miscategorized by
developers, because we want to try the
new shiny technology. And it's easy to
understand why, because companies are
interested in little things, right? Big
data is like Big Oil, Big Tobacco, Big
Money. Nobody wants small potatoes,
small fry, small fish. But it's
rudimentary to think that big data is
better. You badly want it to be big. How
do you know it's small? Receptors go off
in your head to say -- this is too much
work. There's a scientific term for
this. It's called a sinking feeling. So
I want us to escape the psychic prison
of big data. So once you've accepted
that your data is small, what tools do
you have to analyze it? So you can use
Python, you can use Ruby, you can use R,
Matlab, maybe. But those are all general
purpose tools that are quite large, and
all have their own unique dependency
hell. So this is small data. Let's stay
small. And small data's deliverance is
here, with the awk programming language.
So awk is a language that is designed
for analyzing data streams, and it's
been mentioned a few times at this
conference now. So where did awk come
from? So these three gentlemen named the
language after themselves. You might
recognize some of those names. These
guys worked for a funky, cool company
named Bell Labs. You might recognize
them from such hits as UNIX. So in the
late 1970s, awk made its big debut in
version 7 UNIX. It's been around a
while. This is what the a of awk had to
say about it. And we can summarize it
like this. It's a language for
processing text files, where each line
of the input is a record. The record is
broken up into a series of fields, and
an awk program is made up of pattern
action statements, where if the line of
the record matches the pattern, the
action will be executed. So it looks
kind of like this. Pretty simple.
Lowercase awk -- I've been talking about
uppercase AWK for now. Lowercase awk is
the interpreter that runs the AWK
program. And there are many different
variations of that, and they're all kind
of subtle and unique in their own ways,
but we can gloss over that for the
purposes of this talk. But here's the
big reveal -- everybody in the audience
today is going home with an awk! You get
an awk. You get an awk. You get an awk!
And that's because you already have one!
An awk interpreter is one of the
utilities required by the single UNIX
specification, which means you are
pretty much guaranteed to have a version
of awk on your system.

It's already there. You don't need to
worry about Ruby gems or easy install or
CPAN or anything. So let's use it. So
when I take a closer look at my S3 logs,
this is the Amazon server access log
format. It describes 18 fields, it's a
bunch of stuff there. I'd like to work
with something a little bit simpler. I'd
like to work in the Apache combined log
format. And this way I can treat my S3
buckets just like they're a web server,
and this way I can also Google for
stupid little things I can do with that.

So this is the mapping between the S3
fields and the Apache combined log
fields. So let's talk about map. So map
in map reduce is the transformation
procedure, filtering, sorting, that kind
of thing. So if we look at an awk
program where there's a condition and
action, we're going to ignore the
condition for now, because we want this
program to apply to every line of my log
file. And a perfectly reasonable action
is print. This is a valid awk program.
Print, by itself, will just print the
input line again. It's like a no-op.
Well, what do we do next? Well, looking
at a line of my Amazon S3 file here,
it's quite long. But we need to break
this up into a series of fields. And to
do that, we're going to use what awk
calls the FS variable. Or field
separator. Which defaults to white
space, to tab and space characters.

And what's going to happen is this field
separator is going to explode out this
line of program, and break our line into
several different fields, and you can
see some of them break up into different
parts. So the time stamp breaks into two
separate fields. So we have 21 fields
here. And then I can just rearrange
these to basically print out the Apache
combined log format. And I'm done. I've
mapped from one format to another. This
is a valid program. It works. But
there's a bug. Anybody spot it? It's
right there. So what's the bug?

User agent strings are pretty tricky,
when you're thinking about Amazon web
service, web requests, all that stuff.
The user agent field is
non-deterministic, and awk doesn't know
about single quotes, double quotes, what
the semantics of those are. So every
time the field separator comes in, it's
going to break this up. So we never have
a deterministic amount of fields per
line. That's okay, because awk has for
loops. So what we can do is set the user
agent to a string. I always know that
the user agent comes before the last
field of that server log format.

And I can loop through, and I can say --
as long as i is less than nf, which is
number of fields, I can continue
looping, and concatenate the space
character in an awk program.
Concatenation -- and build back up my
user agent string. So that's it. We did
it. It's done. Great. This is a valid
program. I can copy and paste this, and
put it right into the command line, and
run it as such, or I can put on a little
shebang here, and make this executable.
Now, what can I do once I have my awk
program in this format?

Well, then I can do things like change
the field separator to be a quote. And
pipe that output of that program into
this program. And say -- for everything
that requests the feed.xml, print field
6, which is the user agent string. I can
also do this on the command line, with
the f option, that says override the
field separator to be a double quote
character. And then what I can do after
that is run it through some good old
UNIX utilities, sort, unique, sort
again. And I end up with a descending
number of unique user agents that
requested my feed.xml. Super simple.
This brings us to reduce. The reduce
step.

So reduce function does summation,
accumulation, et cetera. So we already
got a little sneak peek of that. So
here's a program. 9, in the Apache
combined log format, is the status code.
206 is a partial content response. And
you will see that when you try to
request, say, a large mp3 file. The
request will say -- give me just a range
of bytes. And you'll get 206 partial
content responses.

Now, awk has associative arrays. So what
I'm going to do next is I'm going to
store the IP address, which is field
number one, and the number of bytes
downloaded, which is field number 10,
and I'm going to keep adding them to
this... To the IP address in the array.
And I'm not going to do anything in this
main block. Instead, I'm going to wait
'til the end of this block, and I'm
going to loop through all of the IP
addresses, and in my array here... And
I'm going to print a pair of IP
addresses, and the bytes downloaded. So
this gives me total number of bytes
downloaded per unique IP address.

I can then output that, and get the
total number of bytes downloaded, and
just sum it all up in one awk program.
So this just pipes into the next one.
And now I can have the total number of
bytes downloaded from my S3 server, for
my podcast, I can now estimate pretty
well the number of... What my bill is
going to be, from Amazon. So that's the
reduce step. Other things that do the
reduce step are these little UNIX
utilities. Sort, unique, word count. So
in this talk, we've learned about map,
we've learned about reduce, and we've
learned how to do it in a little program
called awk. Read the man page if you're
interested in more. It's a great
utility. I love awk. Thanks very much
for listening.

(applause)