>> PAYER: Welcome, everybody. Good morning. I’m
really happy to see at least a couple of people, although, it’s quite early in the morning.
My name is Mathias Payer, and I’m going to talk about how to generate low-overheads dynamic
binary translators. This is joint work. There’s Professor Thomas Gross and the Laboratory
for Software Technology at ETH Zurich. So what it’s all about? Binary translation or
short, BT, is a technique well-known for late transformations. Binary instrumentation, debugging,
profiling, and to add or extend a lot of features on the fly that were not present in the original
program or in the original code. But there are a couple of problems. The flexibility
of dynamics software binary translation incurs at least some runtime overhead and can slow
down the whole program and all that. And the complexity of all the transformations can
be a challenge. So what we try is to offer a high-level interface at compiled time, which
will then be compiled down into effective translation tables. These translation tables
are then used in a low-overheard binary translator to keep the overall translation overhead and
runtime overheard as low as possible so that we can have like a high-level interface with
low-overhead enabling a lot of these late transformations and to be really flexible.
After short motivation, I will give you an introduction. Setting the scene, how binary
translation actually work, what the different binary translation techniques are. I will
discuss design and implementation of a table-based binary translator, talk about the translator
techniques and all that, and the generation of these translations tables. I will then
discuss some optimizations, which enable a fast binary translator in a table-based optimization.
And at the end, I will conclude. So what actually is binary translation? Binary translation
in a nutshell is if you have like an original program on the left-hand side, you apply some
transformation and you’ve got an instrumental program on the right-hand side which will
then be run actually on real hardware. There is static translation for one, which analysis
the original program checks all basic blocks, translates these basic blocks. And in the
end, it meets an instrumented program, which will then be loaded by a ladder and executed
at runtime. So all the instrumentation is already present in the program when it is
started and everything is done statically. The only runtime overhead that we have is
due to the edits, edit instructions or changed instructions. But we see a little cloud over
there. There are some drawbacks to study optimization or study translation. What about self-modifying
code? If the code reads itself and modifies itself or if you have a cheat compiler, we
will never be able to instrument these kind of programs statically because they change
itself and the whole instructions and everything will not be known at translation time. On
the other hand, what about dynamically loaded shared libraries? If we load these libraries,
we add new code to the program, which has not being translated before that. And if you
go more to the molar side, what about obfuscated code? You will never be able to translate
obfuscated code or code that is, for example, available in an image or an exploit or something
like that and static translation will never be able to catch all of these cases. So we
move over to dynamic translation. Dynamic translation starts original program and gives
solution if the original is run. But, in the end, in the back-end, we instrumented the
program on the fly like a cheat compiler does and we have just in time translation of basic
blocks into the instrumented program. We capture all indirect control flow transfers and translate
them on the fly and give the illusion as if the original program would have been run.
And we translate every instruction before it is run. So whenever we see a new instruction
that is about to be executed, we translate it first, cache it, and can add our instrumentation
and all that. So a binary translator looks a little bit like this. We’ve got the original
program on the left-hand side. There’s a couple of basic blocks that are connected through
control flow instructions. There’s a small loop that receive. And then we start the program
under the control of the binary translator. The binary translator uses translation tables
to translate individual instructions and copy them into a code cache. The code cache then
contains the translated version of different basic blocks, which are then interconnected
and optimized. The table generator is an offline part that compiles down into translation tables
which are use every time. These translation tables specify how an individual instruction
is translated and what action should be taken when we encounter such an instruction. So,
I mean, I have the translator that processes individual basic blocks at runtime and copies
them into the code cache. If we have an outgoing etch or a couple of outgoing etches of a basic
block that are not yet translated, we etch trampolines. If three prime would branch to
basic block number four, we would get back into the binary translator and then adds that
basic block and translate it at runtime on the fly. Branch back, do a context switch
into the translator, translate it, copy the translated basic block to the code cache.
So keep up the illusion that the program–the program always test it as execute it in the
original location. If it reads its own machine code, it will read the machine code after
the original program. It never seizes code that is copied into the code cache and the
code that is executed. So keep up that illusion to fix all these, for example, return instruction
pointers on the stack or to be able to execute indirect control flow instructions, we have
to go to the mapping table between locations and the original program and locations in
our code cache. These two locations are translated using a mapping table that can be accessed
at runtime too. So, so much about the basic overview, I will now discuss the prototype
for a translation system like that that’s called fastBT. This is a dynamic binary translation
system. The system itself is actually machine independent, operating system independent.
But the focus of this talk is IA32 and the Linux operating system. IA32 to be able to
access and translate code of a platform that’s most commonly used and Linux because it’s
open source, we can access the ladders and change the whole program like that. So for
an efficient binary translation, we need to be able to specify these translations and
instrumentations. There are a couple of approaches which we’ll use; hard-code translation tables
based on individual instructions. This is very hard for a programmer. But it’s good
for the binary translation system. The binary translation system uses these translation
tables that describe individual instructions to translate them from the original program
into the code cache. But, as I already said, the manual construction of these translation
tables is really hard. A programmer has to encode individual machine code instruction,
has to know the whole Intel instruction manuals by hand and all that, and it’s really error
prone and really hard. So what we propose is automation and a high level description.
We have translation tables in the back and a table generator. We’ve got a high level
interface and a C++ library where you can specify high level transformations on the
machine code itself. These high level transformations are then compiled down into translation tables,
which will then be used at runtime. So we offer a high level API at compiled time to
select individual changes of the machine code and compile these Intel code tables down into
optimized translator tables, which are used of the table-based binary translator later
on. What’s actually possible to offer for such a compiled time high level interface?
We can offer different transformations based on this opcode tables that kept–catch memory
accesses. Does this instruction access memory? What are the source destination or auxiliary
parameters? Is it a floating point instruction and sse instruction? What kind of opcode is
it? Is it an arithmetic opcode? Is it a [INDISTINCT]? What kind of immediate values are there and
so on and so on. They’re able to select and specify these instrumentations as compiled
time and we’ll then generate the opcodes and translation tables that are used at runtime.
So that is all at compiled time. We have the high level interface, and we’re going to now
move over to the runtime interface. The translator uses an iterator based approach. Whenever
we execute code that we have not yet translated, we started iterator and iterates for every
single instruction. We decode the whole instruction. Also, if it’s like multiple bytes long and
execute a specified action for that individual instruction that is specified in the earlier
translation table that has been generated. So this iterator based approach iterates through
basic blocks, copies them into the code cache and then transfers the control back to the
translator code. To achieve low-overheads, we have to master several fundamentals. One
of them is we have to use a code cache for a translated code so that we can–we have
to retranslate every time that we execute it. We have to use inlining for functions
to reduce the overhead of function calls. Well, the function call itself does not incur
too much overhead, but the return instruction is an indirect control flow like any other.
And we to master all other indirect control flow transfers to achieve low-overhead because
these are the ones we cannot fix. We have to do the online runtime lookup. Whenever
you have an indirect control flow transfer as a, for example, an indirect jump, an indirect
call, or a function return, we have to go through the mapping table, do a lookup and
then transfer the control to the translated location in the code cache. This has a lot
of overheads, but we cannot, like, change the jump locations in the code cache because
otherwise we would like to reveal the code cache and the translator code and we don’t
want to do that. So to master these indirect control flow transfers, we have several optimizations.
Usually, we have the runtime lookup and the patching that is required. And these individual
indirect control flow transfers are then replaced by a software trap, which branches into the
binary translator, executes the whole lookup routine and transfers the control back to
the translated program. This is expensive. So we have several optimizations that are
done in fastBT to reduce that overhead and to recover from these overheads. For one,
there is local branch prediction which predicts the location where the indirect counter flow
transfer will go to. There’s inlining of a fast lookup into the code cache which will
check for the most common positive. We have a direct hit in the mapping table and we can
build on the fly shadow jump tables. And I will now discuss these optimizations in a
little bit more detail. The local branch prediction that we use caches the last one or two targets
in the code cache. It emits specified code sequence which caches two or one targets that
has been used before. If you have a cache hits in that target, we can direct–we redirect
the control flow to the correctly translated entry in the code cache. If you have a cache
miss, we look the target up in the code cache, we fix the location and we update the cache,
the local cache. These results in like three to five instructions if we have a cache hit
and a couple of more instructions if you have a cache miss. This optimization works pretty
good if we have like a very low change rate of these indirect counter flow targets and
a limited set of targets. The second optimization is the fast lookup. We emit a fast inline
lookup into the code cache, which checks the first location of which uses the target, does
some fast cache and looks it up in the mapping table. If it is a hit, we can directly transfer
the control to the translated instruction. If it’s a miss, we transfer back into the
binary translator, but if this optimization is more complex and cost us about 13 to 14
instructions, which is more expensive than, for example, the prediction. The next optimization
that we have is a shadow jump table; a lot of the overhead that we measure, for example,
in the TCC benchmarks and all that comes from shadow jump or from jump tables. TCC tends
to generate a lot of jump tables, for example, for switch statement and initializes these
jump tables. The predictor will be pretty bad. They’re pretty bad job because there
are a lot of targets and we have a lot of cache misses. If we now detect a jump table
that is used in memory, we built a shadow jump table with translated targets. Check
the translated target, branch to the translated target, in fact, if it’s the correct one.
Otherwise, we branch into a backup function. These results in a very low-overhead of only
five instructions if the target is already translated and we have a shadow jump table
in place. Similar to the branch prediction but it was a whole jump table that is in the
back. So the problem with all these optimization is that each optimization is only effective
for some locations and a very specific program behavior. If you have a low number of changes
and only few changes–if you have a low number of targets and only few changes, use a cache.
This will be pretty fast. We do the lookup the first time and can then reuse the cache
all the time. If you have a high number of targets and many changes, we have to fall
back to a fast lookup that’s the fastest possible to handle these cases. If the location has
many different targets and the targets don’t change and all are closed to each other, they
most presumably have a jump table that’s why we use a shadow jump table. So, but it’s really
hard to select these optimizations at compile time. If you see an indirect control flow
transfer, for example, an indirect call or an indirect jump, we don’t know if it’s–if
you should use a prediction, if you should use the fast lookup, or if you should use
the shadow table. So that’s not known at translation time only at runtime. An adaptive optimization
on the other hand can select the best optimization for each individual control transfer. That’s
why we have an adaptive optimization that is used for all these indirect control flow
transfer, which direct this prediction for one or two locations. Count the number of
misses and mis-predictions that we have for that optimization. If we have a specific threshold
or reach a specific threshold, we recover to a fast lookup and re-writes the automatically
generated codes that is in the code cache or we construct the shadow jump table if the
control transfer uses this jump table as well. So these adaptive optimizations are then able
to bring competitive performance to a table-based translator. To show you some numbers, I will
now discuss the set up of the benchmarks. We use null-translation to show the translation
overhead only. We use the SPEC CPU2006 benchmarks to evaluate the performance of our binary
translator. And we use the Test dataset to show translation overhead for short running
programs of between two and 40 seconds. And we showed–use the reference dataset to show
translation overhead for long running programs. This is up to a couple or up to a thousand
seconds on our machine. Related work that we compare ourselves, so this is HDTrans,
another table based translator which, where the user has to supply translation tables
by hand and hard code all these translation actions. DynamoRIO, an IR based binary translator
which translates machine code into an intermediate representation does some optimizations on
that intermediate representation and translates that back into machine code. And PIN a tool
that also uses an IR based approach for–to generate machine code and translate its–between
the different formats. So, to show you the numbers for the reference dataset, I selected
the four best performing benchmarks so that you can see the individual overheads. We compare
fastBT, HDTrans, PIN and dynamoRIO. The–we showed you overheads compared to an untranslated
run. On the right hand side you see the overhead of–the average overhead overall specs CPU
benchmarks. We can see that table based translators, kind of, have a problem if you compare them
to dynamoRIO. For some benchmarks, the IR based approach is able to outperform, outperform
the table based approach in some cases. But if we check the average overheads, this is
low for both table based translators and IR based translators. If you look specifically
at the perlbench and a few benchmark. We see that the problem of the table based binary
translators is the huge number of function calls. We have for, each of these benchmarks,
we have more than 20 billion function calls that cannot be inlined in our optimizations.
But, nevertheless, an overhead of 50% is still pretty low for the perlbenchmark. This is
an interpreter that’s running on other program which is then interpret it in a binary translator.
So the IR based approach will, of course, have some performance edge on that one. And
we see in the analysis here that there are more than 20 billion function calls where
we can still or where we could still do a little bit better. But, nevertheless, on the
average overheads if you compare these–the table based iterators show the same performance
than the IR based ones. If you check indirect jumps, on the other hand, we see that we have
the jump table optimization and the predictor. This two are both together are able to cover
about 99% or more than 99% of all the indirect counter flow transfers that are there for
indirect jumps. So we won’t be able to do much better for indirect jumps, no matter
what optimization we use. If we go to the indirect calls, we see that for xalan and
deall we have almost or close to a 100% prediction rate where we use the prediction optimization.
But for perlbench and gobmk we still have–the prediction doesn’t work so good and we have
to fall back to the fast lookup in the end. Now if we go to the Test dataset for short
running programs, on the other hand, we see that table based iterators have a performance
edge compared to IR based iterators. These are programs that execute between four and
between below one and 40 seconds. We see that most IR based approaches are never able to
recover from the translation overhead they incurred. So, in the end, an IR based approach
is able to keep, is not able to keep the low-overhead that it had for long running programs and
table based approach will be able to outperform this IR based approaches. So if we compare
the reference and the Test dataset for our binary translator. We see that for some benchmarks,
for example, for perlbench, we see that the Test dataset is even faster than the reference
dataset. This is due to different instructions that are executed and a smaller set of code
that is executed. So we don’t incur as much runtime overhead due to the translation as
if we would run the reference dataset. So to summarize, we have a higher overhead if
we have a lot of indirect counter flow transfers. Function calls incur some overhead even with
our optimizations. Indirect flow transfers without cache or jump tables add additional
overhead to these translations. Or if we have high collision rates and then suboptimal mapping
table then you have to recover to–then we have to issue expensive recoveries and try
different scheduling strategies. On the other hand, if you have low-overhead, if you have
few indirect control transfers and the cost of the indirect control flow transfers can
be reduced through the optimizations. So to conclude, fastBT, the flow that’s available
shows that it is possible to combine ease of use with efficient translation. We offer
a high level interface at compiled time, compiled down to fast translation tables. These translation
tables are then used at runtime. And at runtime, we showed that we have comparable overheads
to IR based approaches for long running benchmarks, comparable average overheads to long running
benchmarks and we have a performance edge for short running programs up to 40 seconds
and other benchmarks. The adaptive optimizations that we’ve shown select the best optimization
based on an individual location. Compared to earlier strategies there–we have to select
an optimization based on a class of counter flow transfers. We now can select the best
optimization based on each location of an indirect counter flow transfer and this gives
us a performance edge. Adaptive optimizations that I’ve shown you here are necessary for
low-overheads in the table based binary translators. And the table based translator, fastBT, is
available as open source. You can download it from the webpage. You can use it in your
projects. You can specify your transformations. And with that, I would like to thank you for
your attention and I’ll be happy to answer any questions that you have.
>>I have a couple of questions.>>PAYER: Yeah.
>>First of all have you considered doing static jump table recovery? You can get over
99% of indirect branch targets just by static analysis of at least SPEC 2006 benchmarks.
>>PAYER: Yeah, that’s actually one of the targets we would do if we were a company.
The best optimization that you can have is–would be to use a combination of static translation
and stanimic translation. So you can keep up the whole illusion of running the program
in the original location. But you already did a lot of the static translation have it
in the back.>>I mean you could create the shadow jump
table ahead of time because you know what all the targets are.
>>PAYER: Yeah. Yeah.>>And it’s really easy to build all those.
Second question was–yeah, you go ahead.>>PAYER: But the problem is, the problem
is that, that’s–actually, if you would, if this would be released as–if there will be
more people working on it, we would have already done it.
>>Okay.>>PAYER: Right. That’s one thing. The other
thing is the construction of the jump tables or the cost to construct these jump tables
that is really, really low.>>That’s about it. Yeah.
>>PAYER: Even at runtime. We have almost negligible runtime overheads for the translation.
It’s the translation itself is just a lookup in the different tables and a copy of the
machine code instruction to the code cache, so this is really cheap. And if you construct
a jump table, we allocate the page for that, add some specific charts, initialize that
page with a lookup or like backup targets and then when the target is executed first
time, we patch it back. And this is all like, very low-overhand and this won’t changed the
big picture.>>Okay, the second question, it looks like
your performance numbers were really close to HDTrans and dynamoRIO so, tell me, why
am I going to pick fastBT over those guys?>>PAYER: It’s faster than dynamoRIO if you
go to short running benchmarks.>>Uh-huh.
>>PAYER: That’s for one thing.>>Okay.
>>PAYER: It’s faster than HDTrans in most cases. HDTrans doesn’t really offer the complete
set of instructions so it’s only for a limited subset of specific of instructions.
>>Okay.>>PAYER: It doesn’t perform. It isn’t able
to execute all the sse instructions, all the extensions and all that. And for HDTrans,
you have to specify the transformations in machine code. So you specify what machine
code is translated into which machine code.>>Okay.
>>PAYER: And you’re not able to use the high level API. So, fastBT offers PIN-like API,
with the speed of low level implementations. So you combine the good things of both worlds.
So we have, like very low but you can use high level API to sell, like your transformations.
>>Okay. If there are no more questions, I’ve got a third. So, since you’re using this table
based API, it sounds like it would be a lot easier to add a target so let’s say a 664
ARM or something like that. How much work do you think it would be to implement another
target?>>PAYER: As a translator for ARM for example?
>>Right, right.>>PAYER: Now you would need to specify the
translation tables…>>Right, so how much work do you think the
translation tables are?>>PAYER: Actually, the tables, the opcode
tables are just a copy from the IA32 instruct, or from the Intel instruction manuals.
>>Okay.>>PAYER: You just have to write all these
tables by hand and bring them into format that is parsible by the table generator.
>>Okay.>>PAYER: And, of course, you will have to
adapt table generator a little bit to the particularities of the ARM CPU for example.
>>All right. Thanks.>>PAYER: Okay.
>>So, if you look at this in a multithreaded environment, do you need to like stop the
world every time you update the code cache or like do the translation? Because otherwise
another thread could be running there.>>PAYER: Yeah, actually, that’s another difference
to some of the other purchase. We have thread-local, code caches and thread-local transformations.
So we differentiate between each thread and we catch a piece thread create an old user
fork system calls that would change the layout or the number of the threads and all that
and we keep the local. So each thread has its own thread local code cache. This enables
some other speed ups. So, the code cache will only contain code that is local to that thread
and if you have a locker thread, we will generate code for that local thread only. If you have
worker thread that code cache will only contain the code for the worker methods. And this
kind of forms a greedy really trace extraction where we generate traces off code that are
executed on the fly, and our thread-local caches will then contain these whole traces
in there. This enables even some speed up in some cases and if we look at the paper,
we show that we have, for some SPEC benchmarks; we have up to 3.5% performance improvements
even this, all the binary translations running in the background.
>>Just a quick question. So what’s the difference between reference and Test datasets? I was
confused about it.>>PAYER: SPEC CPU offers many kind of runtime
flex and you can select which ones you want to. There’s the Test datasets just to test
your implementation which will only run for like between–below one second and up to 40
seconds. It’s like a limited–the SPEC CPU benchmark set consists of many different program
kernels and they have different sizes of input datasets. The Test dataset is a lot smaller
than the reference dataset. So the reference dataset just runs for a longer time.
>>I see, I see, and I mean in the graph you show actually bigger time for a reference
like much bigger, why is actually? I…>>PAYER: For?
>>Yes, actually the next one, the next slide.>>PAYER: Yes.
>>The next slide. Yes, in here, so I see like the no BTs–sorry, sorry that’s not the
one.>>PAYER: That’s the Test dataset, the runtime
of the Test dataset is a lot smaller, like it’s shorter to run the test program.
>>Actually, this is where I got confused. So you say that test contains many more but
it’s shorter in the same time? It’s like, I mean which one…
>>PAYER: The Test dataset like you have the same set of programs. The Test dataset is
just a smaller input set.>>So I should try…
>>PAYER: Smaller inputting.>>…the reference more, right?
>>PAYER: Yes, sure.>>Because it’s bigger.
>>PAYER: Reference runs longer.>>I see, and in the graphs, you guys performing
much better in the shorter one actually, right?>>PAYER: Yes, because the IR based approaches–if
you run for a thousand seconds, you have to imagine yourself there are IR based approaches
which take the machine code, translate it into a high level intermediate representation
then do some optimizations based on the intermediate representation and then emit machine code.
These IR based approaches will be able to recompile the code all the time if it’s run,
if it runs for a very long time. But on the other hand, if it just runs for a small amount
of time, this IR based solutions will not be able to recover from the initial translation
overhead that they had.>>I see. Okay, thank you.
>>PAYER: Welcome.>>A quick question. So, have you heard of
Strata and have you compared your approach? Strata, it’s a…
>>PAYER: Strata?>>…binary translator…
>>PAYER: Yes.>>From Jack Davidson and Bruce Childers from–and
they’ve done a lot of work with–and, the main contribution of their work is that it’s
extremely low-overhead. So I was wondering if–I think they’ve gotten like 4% overhead
on a complete run of spec. So I was wondering if you compared your approach with Strata
and…>>PAYER: I think the program isn’t–or like
the binary translator is not available openly. So I was not able to compare myself with that
one.>>Right. Yeah, yeah. It’s not–right. You
kind of have…>>PAYER: Okay.
>>…to email Jack and nag him a bit and then maybe in a few weeks you can get it but…
>>PAYER: Yeah.>>Yes, just you know.
>>PAYER: This translator is open source, it’s GPL. You can use it in any projects you
want, and it’s kind of easier to use and compare with such projects with these close ones.
And the [INDISTINCT], it was a couple of years ago, right?
>>Right, right. That was, a good a while ago around the early 2000s. But I’m, cool.
Yeah.>>PAYER: Yeah, I checked his work, I read
the papers but I was not able to compare myself to him.
>>Okay.>>PAYER: Or reproduce his numbers or, yeah.
That’s…>>So what happens when the source instructions
set is not the same as the target instruction set?
>>PAYER: Then, of course, you would need to specify the translations from the source
to the target instructions that–you could add–the, how it works is like we’ve got a
huge table. For each individual instruction, we’ve got an actual function that translates
that instruction. And we have the whole; IA32 specification in the background that we can
use. You would then need to specify a mapping between the instructions and then you could
translate them.>>So the question was have you done that
and if you have, how do these numbers compare?>>PAYER: I haven’t done that but I’ve talked
to a couple of guys about it who are interested in it, and I don’t know how the numbers would
look like. Currently, it’s an IA32 to IA32 translator which specifies on low translation
overhead, low runtime overhead to add profiling, instrumentation, add debugging information
and all that. You can, for example, you could check all memory allocations, check the use,
usage of different finders, discard pages, map pages, and other option would be to like
encapsulates the running program and guard it against overflows, against the exploits
and all that. Check system calls if they correspond to a specific parameter set and all that.
But I have not have not done cross platform transformation but it’s, yeah, I’m talking
about through some guys. Are there other questions? Okay, thanks again for the attention. And
you’re happy and free to download the binary translator. Try it out and write me if you
have run into problems and I couldn’t mind to be able to help you. Thanks.