Our Friday ICS seminar series. My name’s David Sondak. I’m a lecturer in the Institute
for Applied Computational Science. Just so everybody
knows, we’ll be here again next week on the 228th
for a talk on natural language processing. And so just put it on
down in your calendars. Today, we have Nick
Malaya from AMD. Nick is the AMD technical
lead for the Center for Excellence working
on the Frontier exascale supercomputer. Nick got his PhD at
the Oden Institute for Computational
Engineering and Sciences. That’s when we met each
other and became buddies. And before that, he did
his Master of Science in mechanical engineering. I’ll let Nick take
it away from here. So welcome, Nick. Hello. That sound good? Did you turn the– Can you hear me? Is it coming through the mic? Very good. All right. So first of all, I
just wanted to say thank you very
much for having me on your busy Friday afternoon. And so I’ve given
this title here, preparing scientific
applications for exascale computing. And what I’m going
to do today is I’m going to talk through kind
of a few different use cases. I am at AMD. It’s an industry. My background was academic. I did my PhD in
computational science. And now I work in the
research organization for a hardware company. So you might ask why someone
who did computational fluid dynamics, uncertainty
quantification, things like that ends up at
a hardware company. And hopefully, after this
talk, you’ll see why. So the agenda I have here is
to motivate the importance of scientific computing,
high-performance computing, things I just use
interchangeably. I’ll discuss Frontier,
which is very exciting. This is what I’m working on
from an application standpoint. We expect it to be the largest
computer ever constructed. And then I’ll discuss some of
the scientific applications we will be running on Frontier. We’re not building a big
computer just for fun. That sounds fun, but
we’re actually doing it for a real scientific impact. And then finally, I’ll talk
about some of the themes that I see in the future
of scientific computing, particularly the synthesis
of high-performance computing and what people call
AI or machine learning. So I’ll start with a fun one. This is something
that actually, even when I heard I was
giving this talk, I hadn’t come up with yet,
because it’s very timely. Anyone see this press release? Anyone know what
this is, perhaps, is a fun way to put it. And I’m not expecting there to
be too many biochemists here. But what this is is a S230. It’s an antibody. It was known to neutralize
the SARS coronavirus, which occurred, I think, in
2003, some time ago. And what this is is a
molecular dynamic simulation of that antibody applied
to the novel coronavirus that you may have heard of that
has hit the news so recently. So I think this is a wonderful
example of the importance of scientific computing. This is a computational
simulation. It is not an actual model, not
an actual examination of this. So it’s not, for instance,
using X-ray crystallography. This is very timely,
because as we all know– sorry. I forgot. This is not my work. I’m collaborating with folks
at Livermore National Lab. But I just thought it
was a nice motivation. So the virus was identified
on December 31 of last year, at least publicly identified. There’s some debate
about if it was known to the Chinese
authorities earlier than that. But it certainly was first
mentioned on December 31. And yet on February 3
of 2020, this simulation had already been performed. So what’s going on here is that
we have a novel use of HPC. It is a timely thing. There are people who are
dying from this virus today. There is a great
interest in our society to develop antibodies
or vaccines against it. This is done in a
faster way than you could do with experimental
methodologies. It was done on the
Corona cluster, which is an AMD cluster. It has AMD GPUs at Lawrence
Livermore National Lab. It is a remarkable coincidence
that this happened. It wasn’t designed for
it or anything like that. It just happens to
be called Corona, and the coronavirus showed up. It was, in fact, installed far
before the novel coronavirus became prevalent. As I mentioned, it uses a common
open-source package, OpenMM, which is a molecular dynamics
package that I’ll talk about in AMD later in this talk. Another thing that’s
kind of novel here– the antibodies that
worked on SARS, there’s an open question
of, will this also work on the novel coronavirus? They have done some structure. So they have some
sense of the structure. That’s what’s shown in blue. And the question of if the
antibodies will actually bind to it effectively is
what’s being tested here. But in addition to
that, the research team is using machine learning
as a sense to sample the parameter space. They’re saying, are
there slight differences in the antibody that
would make it bind better? And they’re sampling
this space efficiently and trying to basically
preempt research teams in, what will we do
to develop a vaccine? So pretty timely,
exciting sort of use. High-performance computing, very
much center of the stage here. And this is all just to motivate
that simulations are prevalent now. They’re absolutely pervasive. I don’t think I need to
sell this crowd on it. But every field
of human endeavor in science and technology
is informed by simulation at this time. The Boeing 787, similar to
what we were talking about previously, was designed on
computers using computational fluid dynamics,
material science, and such many more
times– iterated, optimized to optimize the
lift, to minimize the drag– much more than they would
have designed it and built it and validated it
in wind tunnels. It’s a key computational tool. It would be very
expensive for Boeing to have built that
many planes, right? They probably should have done
this a little more for the Max, but that’s a different talk. Yeah. Drug discovery,
we just mentioned. Societal impact, global warming. We can’t simulate
global warming, right? We cannot actually build an
experimental lab and do it. This is an area in
which it would otherwise be inaccessible to us unless
we wanted to build a planet, run it for a few million
years, see what happens, right? It’d be too late. I don’t mean to– Quibble. [INAUDIBLE] way. But isn’t part of the
problem with the Max that the software
was good, like, 99.9% percent of the
time, and the 0.1% or whatever the time that the
software– call it simulation ecosystem– didn’t exactly
mimic the conditions, a bunch of people died? Yep. So I mean, not to be
too glib about it, but software doesn’t always
get everything right. No, software, let me be
clear, is not a panacea. And actually, I think this
does not appear in this talk. I have worked in this field
of uncertainty quantification. I think that’s an
absolutely essential thing. For those of you who are
data scientists who are not working on something where you
worry about a plane crashing, you need UQ, as well. It’s something I will not
jump off into a tangent here, but I completely agree with you. The field of scientific software
is not nearly robust enough and nearly as
rigorous as it should be in terms of establishing
these kind of credibility intervals and such. So full-fledged
point, but I claim that in the case of something
like computational-aided design, it will improve it if
we use it effectively, much more so than designing planes in a
wind tunnel and testing them. I’m not suggesting we should
actually remove that step, either, which some people say. Tangentially, if
you’re a data scientist and you’re saying that we should
remove people from the loop, I also think you’re wrong. I think people are an essential
validation step of it. We’re not going to
remove the human element. I’m not going to take
all your jobs with a GPU. I’m very serious about that. It’s not a Terminator situation. This is augmenting
human thought that way. So it’s a good point. And to that point, also,
feel free to interrupt. Let’s talk. I’m happy to do so. We have some buffer time here. So actually, I’ll just throw one
other one that’s fun in there. The gravitational waves,
which we heard about, was an experimental
apparatus, right? They used these interferometers
that were actually in space measuring gravitational waves. But the Nobel
Prize-winning work actually involved some of the
computational effort, the post-processing and stuff. So even if you are thinking
of experimental facilities and such, ultimately, you are
still tying it very closely to computation. And these simulations
are often time sensitive. Hurricanes, if we
model a hurricane and it takes us a month
to run the simulation, it’s probably not
enough time for us to evacuate the right places. But there’s a variety of
timescales that matter. There’s throughput. We just want to run
more simulations, because we are optimizing
over a parameter space. Those might only take a minute. But if we actually need
to optimize billions of different models,
a billion minutes, serially, is actually
a very long time. Coffee time– all of
us have done this. You have something running. You’re installing
something on your laptop. And you go up and get a coffee. That’s disruptive, right? It actually is something
that’s an impediment to you doing productive work. Overnight, if it can’t run
overnight, you go home. You run your job. You want it to be
done in the morning. So there’s various
length scales. Mean Time Between Failure, MTBF. If you can’t run a job long
enough before your computer might run out of
batteries, things like that, these are disruptive. All of this is trying
to build to the point that we need more
and more compute. We want to get solutions
faster and faster. No matter how much
compute I claim, no matter how much
compute I give you, you will find a
productive use for it. I’m at a hardware company. I sell more devices
when I do that. So I’m a little
incentivized for this. But I would show
you some data here. I think this is from OpenAI. It’s a very nice plot that
shows the 300,000-fold increase in compute that
has occurred since, if you can see those axes, 2013. HPC system performance doubles. It’s a power law. It doubles at roughly 1.2
years, every 1.2 years. It’s actually faster
than Moore’s Law, if you look at it,
because of the way we’ve parallelized them. But machine learning models have
been doubling every 3.5 months. So again, this talk is about
scientific computation. Those of you who are into
AI and machine learning also have a reason to pay
attention, because I believe that everything I tell
you here will trickle down into that very soon. By the way, this plot, when
OpenAI did it, was in 2019. It is now 2020. It doesn’t have Transformer. It doesn’t have ERNIE. Any NLP people here? Those have substantially
more parameters than any of the
models shown here. So I forget AlphaGo Zero had a– doing this off the
top of my head now. But it had less than a billion
parameters than BERT and ERNIE, if you’re familiar with it. There was BERT, which
was the NLP model. And then someone, of course–
it’s a very fast-moving field– a week later published ERNIE. Those have trillions
of parameters. So they’re just compute hungry. Some of these machine
learning models take a month to train
and on hundreds of GPUs, so substantial
amounts of compute. So hopefully, I’ve got
your appetite whet now for saying we all
need more compute. That’s what I’m
here to talk about, the beginning of the
exascale era, exascale here being 10 to the 18. It’s a SI prefix. Or a million million, nice kind
of handy way to think of it, and FLOPS. So when we say exascale,
we really mean FLOPS. And FLOPS are Floating
Point Operations Per Second. This tends to be
the aggregate number we look at when we measure
a supercomputer’s capacity to do compute. It’s in a poor man’s
fashion that you can imagine that humans
do a FLOP or two, so you can do a
compute per second. This can do a million
million per second. So I was part of
the team at AMD. Actually, in addition to
being a research scientist, I was the performance
lead for our Coral 2 bid that won this machine. It was a part of Cray, now
part of HPE, Hewlett Packard Enterprise. And this is a grant from the
United States government, Oak Ridge National Lab, to
build this computer, which they are calling Frontier. So some fun facts– this comes
from our friends at Cray. This is the sort of
thing that happens when you give marketing
people information about a big computer. But it’s nevertheless
sort of fun. The network bandwidth
on the Frontier system is 24 million times greater than
your home internet connection. So you could download 100,000
HD movies in one second. There’s that guy in the dorm
who will do that sort of thing, so he could use it. If 7.7 billion people on earth,
the entire human populace, I believe, completed one
calculation per second, it would take six
years to do what Frontier can do in one second. So a lot of compute
capacity here. It’ll cover over
7,300 square feet. That’s almost two
basketball courts. Gives you kind of a
sense of the size. It’s actually not unprecedented
for supercomputer sizes. It’s actually getting
smaller and denser. The reason, actually,
is the speed of light. Closer you can get to things,
the shorter your latencies. And we hit that a lot. The speed of light is a thing
we think about in architect [? 4 ?] very much. And it takes 90 miles of cables. There’s a lot of interconnect. We want things to be
connected to each other so that they can talk to
each other in very efficient and high-dimensional ways. So it’s the distance from
Philadelphia to New York City. I should’ve change that from
Boston to some remote location. So I’ve mentioned FLOPS. If you look at the
current top 500, this is the ranking
of supercomputers. I pulled this down. You can see Nvidia’s computer,
our major competitor. Summit and Sierra are the
two top computers there. The axes’s about
200,000 petaflops. These are the two largest
computers in the world. Summit is, by the way, a
computer at Oak Ridge National Lab. And you see a bunch
of other computers. I stand it out all
the way to Stampede2 because that was a computer I
ran on very much during my PhD. Still on the top 15 or so. And so if you were to
project to what Frontier is, you can see where it’s going. This machine will be
in late 2021, which I am very aware of is very soon,
because I have a lot of work to do on this. But the reason
I’m showing this– somewhat fun, I’m bragging
a little and such. But actually, the point is
that we are entering a new era, right? This is a different epoch. This is a different scale than
we have encountered before. And when you scale out things,
if you talk to people at Google when they do large-scale data
center installs and such, you hear the same things. When you reach a new scale
you haven’t hit before, things break. Bugs introduce. You learn new things about
where things work and don’t. And that’s what I want to
try and capture more of here in this talk to kind of give you
a sense of where we’re going. And the real reason here is,
also, there’s another one. And I meant to
put this together. I didn’t have a chance to. If you look at the
top 500, well, you see the biggest computer. But if you look at the last
computer, the 500th computer on the top 500, it
tends to be a system that you could buy with a very
reasonable faculty startup grant, startups,
that sort of thing. So the point is, well,
then you time lag it. And you say, how long does it
take for the number one system to basically become the 500th,
accessible to all of us? It’s about 10 years. So I’m giving you a window
on where you will all be, even if you just go and
work at any company, let alone a
large-scale company– where you’ll be in 10
years and the challenges that you will certainly
have to circumvent by then. It moves fast, right? Power laws are a
very good thing. They’re great for
compound interest. They’re good for computing. So Frontier at a glance. Lots of words here. I won’t go through all of that. I’ve mentioned some of it. But it’ll be delivered in 2021. We’re in a little bit
of a race with Intel. Intel also is claiming
they’ll be in 2021, so we’ll see who can win. They’re a major
competitor of ours. They’re building a machine
at Argonne National Lab. The government’s very savvy
about these things, by the way. They play you off
against each other to get better bids,
things like that. If you’re a taxpayer, you
should be happy about that. The peak performance publicly
is stated to be 1.5 exaflops. It will be 100 cabinets. So again, a lot of density. Some machines have actually
about 200 cabinets. So in some sense, it’s a very
dense and highly computed machine. And then I wouldn’t be here
if it wasn’t, but they will be composed of AMD CPUs and GPUs. CPUs, you’re familiar with. They’re in all of
your computers. You also have GPUs,
Graphical Processing Units. It’s sort of a funny thing if
you’ve taken linear algebra. You note that the
things like translation, reflection, those sort
of linear algebra things that might be useful for
playing a game as a first-person shooter turning or so,
also are the mathematics we use very commonly in science. They’re matrices operations. And graphical
processing units, which were constructed for
gaming, have turned out to be extremely effective
for scientific computations for the same reasons. So here’s just a little picture. And I want to just
show you, again, this kind of exponential
increase that we’re seeing. Titan, which was top
of the line in 2010, was a 27-petaflop machine. Summit, now, today, is
the most powerful computer that’s known in the world. It’s 200 petaflops. And then Frontier is 1.5
exaflops, so about 7x. But you see Summit
has 256 cabinets. We’re moving to
100 cabinets-ish. It’ll have one CPU, four GPUs. I’ll talk more about that. And you can see also
the system interconnect. You see it was 6.4 gigabytes
going to 25 gigabytes, now 100. If you look at the
compute, I said, well, the computer’s going up by 7. Right? But the network is
only going up by 4. So that might give you a sense
of some of the bottlenecks that are happening. Everything’s growing. Everything’s growing at
a power law, let’s say. But when you have power
laws, the exponent becomes very important. In fact, they diverge, right? One little difference, it grows. This happens in chaos theory. It’s also happening here. So thinking about how
we architect our systems for balance is very critical. And I claim it will have
impact to algorithms soon. So AMD’s contribution of it,
that was the full system. We are building the
CPUs and the GPUs. I don’t think I need to
pound on this too much more. But it will have our custom
CPU for it, custom GPU. It’s actually
compute engines now. They’re no longer
graphical processing units. You couldn’t play games on them. They’re specialized for compute. We just kind of call
them graphics units because of a historical
anachronism, largely. They’re really more compute
accelerators at this point. So this is more of an
architecture slide. But I’ll just show you. This is a modern server CPU. It has 64 cores, 128 threads,
256 megabytes of cache. Main takeaway, actually, from
this that you want to see is this is about two
teraflops of compute. Now I’ll show you the GPU. Both exist today. These are not what
will be in Frontier. But you can guess these
numbers will go up. And here’s a modern AMD GPU. And now, I told you that
the compute of the CPU was about two teraflops. Well, this is about
6.6 teraflops. You already have a substantially
larger amount of compute. And you see, if you’re
interested in machine learning, machine learning is very
amenable to reduced precision. So instead of using
FP64 double precision, they go down to things like
brain float, bfloat16 here. We have FP16. Those rates are the same. Already, the FP16
rate’s 26.5 versus 2. So it’s substantially
amount more of compute. This is why GPUs are the killer
hardware for machine learning today because of these things. The difference is
that they need to be using vector-like operations. So you have, essentially,
a CPU which does things like running operating
systems, very important, and serial,
out-of-order processing. And then you have a
GPU, which is very good for vector processing. For sciences, well, many of
our algorithms are vector like. But serial components–
excuse me– come into it. And we’ll talk more
about that in a second. So then the other
thing that’s kind of exciting about this technology,
and I think it is worth noting. We have a GPU. We have a CPU. They have memory. I mentioned that. I didn’t show it, but the MI50
has 32 gigabytes of memory. It’s actually not that much
compared to a CPU today. So you have to think carefully
about where the memory goes. One thing that’s
very exciting, this is happening in
the supercomputing. And I suspect it will
trickle down to consumers. It’s all connected coherently. Has anyone taken a coherency–
know what coherency is in an operating system sense? Coherency is when
you run in a cache, you have caches in your CPU. You have an L1, L2 cache. If a number changes
in that cache, what happens to the main memory? Does it know that
it’s been invalidated? Well, it actually emits a
signal to say it’s invalidated and needs to update the memory. So it moves through
the cache hierarchy. You don’t have to,
as a programmer, worry about what
data you’ve changed. This system– this
is all public– will have coherency between
the CPUs and the GPU. So we have a processing element. On the GPU, it makes a change. The CPU will know about
it and vice versa. And this is very important,
because you guys don’t care about this,
to tell you this, because you can start
thinking about this as a holistic system. You can program to it. You say, look, I just
have a vector calculation. I want you to deal with that. And then you say,
well, you don’t have to move it back and forth. You just say, well, if
I need it on the CPU because I’m doing some data
processing or something, it will deal with that, too. So programmer
productivity is something that is actually important at
the system architectural level. So what I am
personally involved in is the application readiness
for these computers. And some of the scale I’ve
been trying to show you is to say that this is going
to be a difficult challenge. The labs are very much
familiar with this. They have, in fact,
what they call CAAR, the Center for Accelerated
Application Readiness. And this is an OLCF, Oakridge
Leadership Computing Facility, which is a group in
Oakridge that are just designed to try and take
applications today and prepare them for this computer. The scale, all the new
architectures, things like that are things that they expect
they will need to do. Running these
computers at scale can cost millions of dollars in
electricity alone per year. So they know that every day
that the computer sits idle is a substantial expense
that they try to avoid. And by doing that, they prepare
apps to run from day one. And that’s where
they pull us in. One thing that’s very
interesting to note is you see the applications. Their goal is to have the
applications run with a figure of merit, as they call it. They choose some problem
that’s reasonable, and they want it to
be a speedup of 4x. And I’ve already told you the
machine is seven times bigger. And so they’ve already moved
the goalpost down, saying, we’re not going to
get the efficiencies that we would like. We’re not getting all
the compute out of it. And this challenge is only
going to get worse, right? You hear all these numbers
in machine learning about 100x speedups and such. Those are often very
specific numbers. Look at the whole application,
and those speedups are much more modest. And I have a very nice example
of that in a moment, I think. They selected these apps I’ll
show you in a moment just based on the science but also
on various other things– the implementation, the models,
the algorithms [? under it. ?] They want to have a
representative space of applications, because
they care very much that this be not a one-trick pony
but rather representative of, overall, the entire
computational modeling space. They want development plans. They want people who have
well-scoped plans and things like that. So they’re very
careful about it. But here are the eight
CAAR applications. And you see that the
domains range a wide variety of scientific interest
from plasma physics, nuclear physics, fluid dynamics,
astrophysics, things like that, and a variety of
academic partners that we’re working
closely with here. I will mention that
these applications are open proposals. So be on the lookout. There’ll be more. If you have scientific
or machine learning codes that you would like
to propose for Frontier, you’ll have my email after this. And let’s talk about it. There’ll be opportunities
to get on this machine. And it is unprecedented size. And of course, no
talk would be complete without pretty pictures. So here are some of
the examples of these. So Cholla, the astrophysical
simulation there, does a hydrodynamic simulation
of galactic modeling. You see that in the middle
picture there, essentially a galaxy moving, some of the
mergers that you might see, things like that. Cool physics. The bottom right
picture is a picture that I’m very fond of,
because I worked on this code when I was a graduate student. It’s a 3D homogeneous isotropic
turbulence simulation. This is more of a fundamental
physics investigation into, how does turbulence work? A lot of the models
that they use for things like
designing the Boeing 787 are imperfect models. They are not correct. We make approximations. This simulation is resolving all
the features of the turbulence so that we can inform
the models better. So that’s kind of the
way to think of it. But I do want to spend
a little time here. These are the apps, and these
are a little more of what they’re actually running. Give you a sense of what they
use in scientific computation. Some of you in this
room, maybe most of you, will not know what F90 is. But it’s Fortran 90. That is a standard
set in the 1990s, of which I suspect
someone in this room was not born on that date. There’s even a very modern
Fortran, or Fortran 2008. So there’s some
legacy stuff here. But there’s also C++. That’s really starting
to win out, luckily, in scientific computing. But no Python, things like that. No Julia. Yeah. Not yet. Julia is exciting, but not yet. And typically, these codes,
if you’re really playing from home, are C++17. We’ll see if 20
comes in a little. Wide variety of algorithms,
too, for those of you who are computational
science people. There’s finite
volume hydrodynamics. There’s spectral methods. There’s a PIC, Particle-In-Cell. Wide variety of algorithms that
are represented here, as well. All of this is
discretized with MPI, if you’re familiar with it. It’s the message-passing
Interface. This is a common thing
in scientific computing for distributed computing. It’s also used in machine
learning a fair bit. TensorFlow does not use it. Some of the other
frameworks, like PyTorch, do leverage it for multi-node
sort of distributed computing, parameter servers and such. These are eight apps. These are the ones that they’re
kind of pushing hard on. But there are many more apps. I don’t want to give the
impression that they’re only going to run eight apps on it. Here’s just 12 others
that I’m engaged with. Again, computation. Combustion codes,
high-energy physics, cosmology, lots of fun
topics in that regard. So what I want to do is
dive into one of the apps. This is NAMD. So I showed you the
picture of Corona, the coronavirus
modeling earlier on. NAMD is a very common
molecular dynamics code. So it models the interactions
between molecules. It can time evolve these. D. E. Shaw, for
instance, is a company that does this entirely
as their business model for their scientific arm. This is a represented workload. It’s not the coronavirus
one, actually, but it’s one I had previously. So what I show you here are–
you see there’s three colors. They’re coming through
relatively well. This is just a naive
examination of the CPU running a NAMD kernel of a
relatively common molecule. And you see the vast
majority of time– this is a percentage plot– 90% of the time is
roughly spent in what they call the compute kernel. They’re non-bonded
force interactions, if you’re familiar. It’s evaluating the potential
functions between atoms. And then the rest, well,
there’s some host computation, so some bonded forces,
things like that. And then there’s some
data transfer type stuff, long-range
electrostatics and such. So if you are a developer and
you start working on this, 90% of your time is being
spent in one kernel. You do what Donald
[? Nuff ?] says. You say, that’s the thing
I’ve got to optimize. So let’s hit it hard. This is on a CPU. So just transferring it,
it’s actually very amenable, the bonded forces to
vector operations. Very easy to port to a GPU. You get an 8x speedup, right? The mythical 10x GPU speedup. And so what happens? Well, the bottom now is the
plot of the wall clock time. It’s normalized again. And so what happened? Well, the compute kernel
is still 50% of the time. You got an 8x speedup. And what’s happened
is the other pieces are now suddenly intruding
into what you’ve done. This is Amdahl’s law in action. What happens is we’ve
had so much success with our compute
in the main kernel that the other pieces of
computation are now suddenly the bottlenecks or at
least 50% of the runtime. This is a common problem. And it’s only getting worse,
because our compute is accelerating so much. We’re getting so
good at computing that all the other
things are starting to pop out and really hit us. If you blow this up, the
last timeline I showed, well, how can you decompose it? Well, there’s some
challenges here. There’s the GPU compute. That’s still the blue part. And that is GPU compute limited. Well, we can hit that hard. We’ll keep just making
our GPUs faster. Great. So more high-performance
double precision operations. What do you do with
the other pieces? Well, some of it are 3D FFTs. If you’ve studied [? 3D ?] FFTs,
these are typically transposes. So they involve
network communication. And they’re also then
decomposed into 1D FFTs, which are– if you know
big O notation– an n log n operation. That’s not that much compute. The bonded force kernels
are much more like a gem, a matrix multiply, an
big O n cubed operation. Lots of operations going on. Very easy to keep all
the compute units happy. n log n, not that much work. It’s just a little
over linear, right? So when you do this, suddenly,
memory bandwidth plays a role. So now you’ve got a
different sort of bottleneck in your system. I also mentioned
network communication. So now you need to deal with
the network speeds, which, I mentioned earlier, are only
going up by a factor of four versus a factor of seven. So this bottleneck
isn’t getting better. Finally, you have
the still of the red. That’s stuff
running on the host. Well, you’ve got all
sorts of other things that might go on there. Some of it involves
host transfers, copying back and
forth between devices, which I mentioned we’re
trying to get better at with coherency and such. But nevertheless, this
is still something that’s creeping in more and more. So suddenly, you’re going to
be doing no time on compute. And all you’re going to be
doing is transferring messages back and forth between devices. Not very efficient, right? So that’s just one aspect
of these system workloads we’re looking at, kind of
thinking about the system balance. The next piece of
the talk, actually, is a little bit of a transition. I wanted to talk
about the convergence of scientific computing in AI. So if you remember
the coronavirus thing I showed earlier, this
was using machine learning for an effective sampling
of a subspace, couples machine learning with a
molecular dynamic simulation. Some very interesting
results in the field here. CosmoGAN, if you
haven’t seen it, is using generative modeling
to essentially produce synthetic data for
cosmological maps. Pretty cool result. That’s
from Lawrence Berkeley Lab. There’ve been resolved
in Eulerian flows and surrogate modeling and such. What I want to show you about
is some of my recent work, partly as an advertisement. This is done with
Octavi Obiols– excuse me– Sales. He’s Spanish. And he was a co-op who
worked at AMD for a summer. So it’s part of the
talk where I can say we have co-op
positions for those of you looking for summer work. And what we did
here is we sought to do an end-to-end modeling
of the Navier-Stokes equation. So I told you earlier that some
of these physical simulations are expensive. So the thought is
instead of solving a coupled set of
partial differential equations, three-dimensional
equations, Navier-Stokes equations, non-linear with a
turbulence model, the SA model, we could instead just pop
in a machine learning model. Train it on actual CFD
data and try and use it. Presumably, it’d be quicker. And use it to predict
turbulent flows. I mentioned the Boeing
example earlier. That’s kind of a good
representation in your head. Many cases, you don’t care
too much about accuracy. You care about just getting
roughly the right answer, within 20% or so. Doing that many,
many times, let’s say hundreds of thousands,
millions of times. So in this case, the
thought is, could you use this as a
surrogate model instead of solving an expensive PDE? And could you use that
for effective things like subspace sampling,
parameter searches, things like that? So this kind of lays that out,
precisely how we do it here. But on the left, you have
an initial condition. There’s no fluid
flow simulation. There’s a nice picture of
an aerofoil in the middle. And then on the right,
you have a picture of the fully resolved,
final steady state solution of the flow. This is just a pressure field,
but it’s not too important. So normally, you just
do a physics solver. It might be an iterative solver. It takes n iterations. Just solves the physics. On the bottom, what we
propose is actually– and this is something that,
as is typical of these talks, you show a final state
versus what you started with. But we do the physics
solver for a few iterations. Let’s call it k iterations. This is called a warm-up. You then use the
steep learning model that has been trained
in an inference mode. It takes the input image. It spits out what it
thinks the final is. And then you use
the physics solver for a few more iterations, kind
of a refinement at the end. The reason for this
that you can say, ah, a posteriori reasoning,
is that the warm-up, well, when we give it
the input conditions, it doesn’t really know what
those boundary conditions are. They’re not codified at all. You could embed those in
the machine learning model. I’ll talk about the
architecture on the next slide. But there’s no reason to do so. In fact, we found
that unsatisfactory. What we want here
is a model that we can plug into any
physics package, no knowledge of the
physics required. So we give it a few iterations. It gives it a sense of
that boundary layer. It’s starting to
resolve the flow. It sees where, for
instance, the velocity is 0, where the
incompressibility should be 0, things like that. So then you give it
to the inference. It spits out something. But it still doesn’t
actually satisfy all of the physical laws
that we know to be true. For instance, the velocity at
the wall might not be quite 0. It might be very small. So the physics model in
the final few iterations really imposes those
physical constraints. And when we do
this, as I’ll show, we get substantial speedups. So this is a way you can kind
of push your neural net button, and you will speed
up your simulation. It’ll be quicker. And yet it actually fits all
of the convergence criteria of the physics
simulation to begin with. Still satisfies
incompressibility. It still satisfies, let’s
say, a 10 to the negative 6 convergence criteria, all that. So if you were, I claim, to
have a black box between these and you were to run
your physics simulation, you wouldn’t know
which one I had run. You wouldn’t be able to tell. So no talk on
machine learning is complete without a picture
of convolutional layers. This is the network
architecture. So it’s relatively shallow, in
the parlance of deep learning. It’s three convolutional
layers coupled with what I hate to call
deconvolutional layers, because they’re not
deconvolutional layers. This is the part of
the talk where I rant. They are transposes
of the weight matrix. That’s not a deconvolution. “De” would be the
inverse, right? If you’ve taken linear
algebra, you should know that. So don’t do it. It’s not in the paper. But for brevity, I put it here. So there’s three layers here. You start on the left. There’s the initial condition. These are fed in as,
essentially, images. A velocity field can be
thought of as just an image in this sense. You move it through three
convolutional layers. You see that we use kind of
standard activation functions, like hyperbolic
tangents and such, here. And at this point, you’ve
reached the middle, the yellow here, which is the projection
operator to, essentially, a latent space. So you have some
latent representation of the physics that’s
being represented. A classifier, this might
just be a 1, where you say, yes, I think it’s a cat. In this case, we believe
it corresponds roughly. And I hate to use the
word “learning” here, but the network has
learned some features of flows that tend to be
characteristic– for instance, that near walls, the velocity
should be 0, et cetera. But this is very
closely related, and this is why I
mention it, to generative methods, auto
encoders, and such that are very popular right now. So then you follow
from the latent space to projection back
to the forward in time, which goes
through what we would call these deconvolution operators. And it spits out what
it thinks the velocity field should look like. By the way, this
is all the field. So I should say, in addition
to the three components of velocity, it also spits
out the pressure field, energy fields, things like that. So it’s complete
end-to-end package. This neural network architecture
is used for all of them. It’s consistent. So again, it has no
knowledge of the physics. [INAUDIBLE] I should be fair. So the question is
for transient flow. It’s a great question. We haven’t done it yet. So this is all steady. That’s future work. I think that there’s a variety
of ways you could look at it. Maybe we should
take that offline. But that’s something I’m
very eager to look at. It’s a good question,
but this is steady. The training data. So on the top, I have what
are a series of ellipses and then the testing data,
which are an aerofoil. It has not seen this. It has not been trained on it. It’s been trained on things
similar to an aerofoil in geometry but
not quite the same. So this is a mildly
extrapolative state. It’s extrapolating a little
to something it’s never seen. In particular, you
see I highlighted in red the trailing edge of the
aerofoil, which is certainly not elliptic. It has important
repercussions if you’re familiar with fluid dynamics,
such as separation, pressures, and such. And then there’s an
ellipse, which it has seen. It has not seen this particular
eccentricity of ellipse, so it is an interpolation
between two it has seen. And then finally a cylinder. So what we’re doing
here is trying to test it on conditions
it hasn’t seen before. So this gets back to the
question at the beginning. How do we have confidence
in our predictions? Well, we can look at it
in different regimes. Regimes in which it’s
been extrapolated, it’s going to do the worst
on those, because everyone does worse when you extrapolate. Cases where it’s seen it
already, well, it nails those. And then cases where
it interpolates– it hasn’t quite seen that case,
but it’s seen cases around it. And the results, it will come as
very little surprise, are good. So this is the velocity
field for the channel flow. So in this case, on the top and
the bottom, it’s like a pipe. There are walls. And then the flow. So a is if we do this
with no refinement. So we actually just use the
model, see what it spits out. b is where we use it with
this final refinement, where it actually reuses the
physics solver at the end. And then c is the
physics solver itself. By eyeball norm, you’re not
going to see a difference here. d is where we actually– difference between them. And you see that the absolute
error is, in many cases, substantially less
than a percent. Yeah. So a couple of slides
back, you showed the– on the top, you have
the physical solver. In between, you put
the neural network. Yes. It’s not clear to me
how you train that. [INAUDIBLE] and the
architecture you have later are not the same, right? That’s right. And yeah, all we did is we
trained it on the final states. And I should have actually
added to that slide. So we give it the– we sample. So we have an iterative solver. And I’ll show you. It takes a few thousand
iterations just to converge. Let’s save those iterations
before convergence. And then we’ll give
it the final state. The final state is the truth. That’s what you’re
trying to minimize. And then the input data,
the training data itself, is those iterations
leading up to it. Yeah. But it is different
in that sense. The pressure field, very
important if you follow this. Again, this is
slightly interesting that you see it doesn’t nail it. It’s slightly different. I don’t have a dif on this. But this is an area
that’s harder to get. It is a harder condition
to impose in these solvers. If you’re familiar
with CFD, you know that pressure is the hard part. It doesn’t nail it. It does pretty well. So it’s interesting. It learns the velocity
field exceptionally well. Doesn’t learn, doesn’t
calibrate, so that it can do– lots of reasons that it
might work or work not. Maybe it just needs more
capacity, more layers. That’s the typical
trend in deep learning is to just throw
more layers at it. I find that unsatisfactory. But nevertheless, this is
where it’s slightly weak. So accelerating convergence. So here are the plots
of some of the results. This is what the case
we called– and it’s not particularly useful,
but the observed geometry different
flow conditions. So this is where we said this
is essentially a verification test. We test it on cases it’s
already seen before. It’s been trained for it. How well does it
actually get it? And how much does
it speed it up? So not unexpectedly, you see
that it gets about a 7 times speedup. So it goes from about
1,500 iterations in the CFD to only requiring
about less than 300, so big speedup. And again, it’s meeting all
the convergence criteria that you had before. You could imagine you’ve
run a code before. Your network is trained on it. You want to run it again for
a slightly different case. The flow conditions, the
Reynolds number has changed, minor changes like that. And suddenly, you
say, you could just use your neural network for it. It would be a speedup, right? This is not an
unprecedented mode to run in for engineering
problems of [? interest. ?] So you have a database of
runs you’ve already done. That would be a useful case. This is a harder case. Now this is a subset of the
velocity– or the geometry, excuse me, the same
flow conditions. So we don’t change the flows. But now we change the
geometry it’s been trained on. It’s been trained
on one geometry. It has to run on another. This is the interpolation
case I showed you before. The speedup is more modest. What that means is
that, well, you’re not giving it stuff it’s
already seen precisely before. It’s now guessing a little more. It doesn’t perfectly
represent the results. It certainly doesn’t
perfectly represent the Navier-Stokes operator,
the underlying mathematical operator. But nevertheless, the
speedup is still 3.5x or so. So it’s a not
insubstantial speedup. And again, because
of the way we do it where we have these final
iterations on the solver, it still satisfies all
the physical constraints that it needs. It’s just the speedup
is more modest. And then finally, the one
that you all came for, the extrapolation cases. You’re still, in many cases,
getting 50% 2x speedups on both aerofoils
and the cylinders. Those are the cases
it’s never seen before. They’re still modest
extrapolations. I don’t want oversell this. These aren’t wildly different
geometries than it’s seen. But it is maintaining
everything it did before and yet giving appreciable
speedup for it. So again, from your
perspective, you could be running a CFD code. A Boeing engineer could be
doing optimization on this. They would probably have
confidence running this, and a 3x speedup might
be enough for them to say this is a big win to them. So I think this is a
very exciting result. It’s very in keeping
with what’s going on in machine learning and AI
when coupled with CFD here. But scientific
computation in general, you don’t expect these
things, as I joked before, to just replace your jobs. That’s not what the purpose is. But as using it as
a surrogate model, replacing computationally
expensive physics, using it in very specific ways,
that’s a great place for it. And so I see this,
again, as a trend that you should be aware of. By no means do I
think that we will not know what PDEs or
physics is in the future. But I do see this as a method
of supplementing scientific computation, particularly
for optimization of such. But so someone mentioned it. I think there’s a variety
of ways this could go. I’m very excited
in continuing it. Unsteady flows, looking at other
turbulence models and regimes, I think that’s actually
of less interest. I don’t see any
particular reason the Spalart-Allmaras model was
better or worse than others. And then the other area that
I’m particularly interested in is physical demands. We’ve done it for CFD. That’s because of my background. But looking at, for
instance, molecular dynamics simulations, can we use
this general architecture for many different cases? And in transfer learning, which
is common in machine learning, you have trained weights that
have learned certain features. And then it accelerates
the training of a network when you
apply it to a new domain. There’s reason to be optimistic
maybe that would work here, too. So you might have a general
network architecture, and you could pull
it down and train it for your own particular physics. And it might take only a
little time to train it. It should be very exciting. The final area I’ll
mention is there’s no reason you
couldn’t apply this to more abstract
mathematical abstractions. That’s something I’d like to
look at, thinking about it. Iterative solvers,
multi-grid, in principle, very similar to the process
we’re looking at here. So I only have 10 minutes. So I will just throw
up my conclusions. I tried to take you on a
talk through what we’re doing with the supercomputing
but then try and tie it back to some of the applications
and domains and things that I think will hit scientific
computing in the next decade, let’s say. And some of those include these
heterogeneous architectures, a lot of the different
programmability. It’s going to get very hard. Those [? free wins ?] are
not going to appear anymore. You’re not going to
just buy a new CPU. But this, in turn,
will be an opportunity. There’ll be new
algorithms, new approaches that you can look at there. Lots of good papers
to publish there. And it will actually
involve focusing on how you couple with the hardware. And that’s, obviously, to come
full circle, my role at AMD as a computational scientist. I work with a lot of
hardware architects who do not know what
their hardware is used on. And tightening that loop
is a very important thing. And I think data scientists will
be looking at this, as well. You won’t just
build a data center. You’ll build a data center
for a specific purpose. And similarly, we’re doing that
with the supercomputers today. So think about that
kind of architecture as you go off in your
careers that way. So I’ll end with that. I’m happy to take questions. I have my email if you
think of something later, and you had a good retort. And then I do have
Frontier stickers if you want one for your laptop. So you can go. Thank you. [APPLAUSE] Thank you, Nick,
for the great talk. I do have a bunch
of questions here. Go ahead. From my own back of the
envelope calculations, can you tell us how
many [? builds ?] will be [INAUDIBLE] Frontier? No, I can’t. No comment is what
I’m supposed to say. Sorry. So you mentioned the trend
of integrating ML with the high-performance scientific
computing [? original ?] [? algorithms. ?] Do you see
any new challenges to hardware? What are the modifications? And can you give
some examples that motivate new hardware designs
because of that integration? To take it even
one step further– and I didn’t put
it in this talk, because it’s become a cliche. But Moore’s law is ending. We are not getting
the [? free wins ?] on transistors we used to. We are special
designing our hardware. And yet everything I said
here was to say, let’s be more general purpose
about our algorithms and our approaches. So those are two
completely opposing forces. I think it is clear that,
for instance, our competitor, Nvidia, is developing
devices that have tensor cores on them,
which are specialized machine learning hardware. There have been many attempts
to leverage that for HPC. And it hasn’t worked
so well in my opinion. So I think the answer is
still general purpose. We have to architect
carefully our designs so that we will reach
the broadest base. I think the labs do a
good job of covering the physical regimes. But internally at
AMD, what we look at is a much broader basket. It’s things like that,
but it’s also ResNet-50. It’s also Transformer. Try and capture, what are
the computational motifs that we can really leverage? Do you know why tensor-specific
design may not work well? Well, if you look at,
for instance, the Google TPU, which you may have heard
of, that is a systolic design. It’s much more like
an actual matrix operation versus a vector. There are pros and cons to this. The problems that I
encounter in the NAMD example will be pervasive in the TPU. It will work very well for
convolutional operators of a certain size. And it will work
very badly in a week when someone decides
convolutional operators are useless. And at the rate
machine learning is moving, in particular, but
even scientific computation, I would be very hesitant
to inscribe in hardware, which takes many years to do and
design, something that focused. I think general purpose but
factorizable is the right way to go, personally. Yeah. How are you measuring
the megaflops? Is it using LINPACK or– Yeah, that’s a good question. I wanted to talk,
and I just didn’t have time, so a great question. When I showed the
pictures here of how these computers are actually
measured, I did show the peak. I did not use LINPACK. So the question
is LINPACK is a– sorry, let me see if
I can find the slide. LINPACK is essentially doing
matrix multiplications. It’s right here. One more. So these numbers are
the peak numbers. They are the rated
capacity of the device. No application will ever
hit that, let’s be clear. You will get some
efficiency out of it. And that efficiency is,
let’s say, 73% on Summit if you do the math. So they’re only getting
73% of this number here. Similarly, I show you
this big, old number here. We’re not going to hit that
number on any real application. But if you were to,
say, hit 73% on HPL or high-performance
LINPACK, that would be a reasonable
rough estimate. So for those of
you who don’t know, the top 500 list does
not just take your word. If someone says my device
runs at 10 exaflops, they don’t believe you. They have to measure it. And we measure this empirically
with a standard data set or standard problem,
which is called high-performance LINPACK,
which largely just does matrix multiplies of
a very large nature. It makes it pretty
compute bound. So it tends to be a very
nice and flattering portrait of the computer
versus the reality. Most apps will be
lower than HPL. Yeah, I have an
architectural question. So Nvidia acquired
Mellanox, right? Yes. And you can now direct EMA into
the global memory of the GPU by going over. But I think it’s
only PCI-e 3, right? Because it’s Intel limited. That’s right. And then in their unified
memory architecture, you can go back and forth. But underneath the hood, it’s
still going over PCI-e 3, right? So you guys have PCI-e 4
in a different topology. So what’s going
on under the hood? And what’s your sort
of interconnect DMA sort of ecosystem? I’m so glad you asked. Good question. We have what we call
Infinity Fabric. This will be a PCI-e
Gen 5 compatible, but it will be substantially
higher bandwidth. It’s internal to AMD. And the other key thing
here is that it will connect between both CPUs and GPUs. So it’ll allow for a coherent
memory space between both CPUs and GPUs. Yeah. But between the HPL and
the GPU and the regular EDR and the CPU, is it still
going over the PCI-e bus? Or do you have a
separate path for that? Yeah. And so Nvidia has
not done us a service with this unified
memory statement. Because to me,
unified memory means it’s a global address space. That’s what it should be. That’s not what it
means right now. There are a variety of ways that
it occurs in Nvidia hardware today. Those can be explicit copies
over the PCI-e bus, where it just migrates the
whole data structure. So I’ve got a data
structure on my CPU. And the GPU says, I’d
really like to use it. And it just copies the
whole thing over PCI-e. As I think you’re implying,
that’s very inefficient. The coherency I’m talking
about here is different. It’s much more fine-grained. It’s much more like a
C++ coherency model, if you’re familiar with that. What is happening here is that
it will require a zero copy. So again, like a cache
line, you can just pass an invalidation,
which will say, this cache line has
been invalidated in this data structure. And then it will pass, let’s
say, a cache line to it. So it’s a much
smaller message size. It’s not the full
buffer you need to move. That’s how it works in caches,
because that’s more efficient. Now, if you want,
it’s like they say. If you like your doctor,
you can keep your doctor. If you want your migration,
you can keep your migration. There will still be ability
to explicitly migrate data. And then there’ll also be this
[? kudo ?] [? amendment ?] [? buys, ?] if you’re
familiar with it. There’ll be similar
things to that. So it’ll still move them if– the thought being here,
if you once in a while just pap the CPU and say, I need
this updated data structure, that’s great. If you’re doing it
constantly, you’re going to see the PCI-e bus. You’re going to see
whatever your connection is no matter what. There’s no way you can
architect yourself out of that. At that point, the
software should say, this guy keeps asking for it. This person keeps asking for it. Let’s just move the
whole data over. And that’s where
it should reside. But how you choose that– is it five times? Is 1,000 times? That’s a little bit of more
of a clever tuning parameter. One more question, I think. Yeah, go ahead. I wondered, what was
the FLOP we could achieve with the classic CFD
store and with the CFD net? My impression is that those
classical numerical methods, like finite volume,
finite difference, they have pretty low
computer memory visual. So the FLOPS are typically
much lower than [INAUDIBLE] net [? root. ?] I’m
wondering, do you achieve a much higher
machine utilization by adding machine
learning into [INAUDIBLE]?? That’s a great point. And that’s part
of the motivation here is that if
everyone got that– most CFD solvers, just in
general, are memory bound. They don’t have
very big stencils. They have very few
FLOPS per load, if you think of it a
memory system ratio. Deep learning has been
carefully constructed in part to fully utilize the device,
very much use the ALUs. The number of FLOPS per
memory load are very high. So methods like the
machine learning one are, in some sense,
computationally more efficient. So I don’t know if that
answers your question, but I agree with the
spirit of it, I thought, or at least the gist I got. [INAUDIBLE] number is. What percentage of FLOP you can
achieve with original model? With the model I
showed, it achieves well over 90% utilization
of the device. If you were to do
the same thing, I have not measured
it for the CFD. But it would look like
the memory system. So yeah, it’s doing a
lot more as a result. I think that’s a powerful
part of the technique. Thank you very much, Nick. Let’s thank him for coming. Thank you. [APPLAUSE]


1 Comment

Tim · March 24, 2020 at 5:43 pm

amazing content it was really entertaining

Leave a Reply

Your email address will not be published. Required fields are marked *