At the HPC User Forum in Edinburgh, Scotland, we had the opportunity to catch up on the NEXTGenIO results with three people from the project that has just finished: Tiago Quintino, Michele Weiland and Adrian Jackson.
Michele, can you tell a little bit about the project achievements?
Michele Weiland: The project has just finished, as you said, at the end of September, after for four long years of collaboration with many partners, such as Tiago from the European Centre for Medium-Range Weather Forecasts (ECMWF). Other partners are Fujitsu, Intel, Technical University Dresden, Barcelona Supercomputing Center, and so on. We have all worked together to deliver a prototype with Intel's Optane DC Persistent Memory that includes both the hardware solution, the prototype we host in Edinburgh, and a full software stack delivered by some of the partners.
What were the project goals?
Michele Weiland: The project goals were to remove the I/O bottleneck as much as possible from HPC simulations, and not just traditional HPC simulations but also the upcoming data intensive and data analytics type of applications. The aim was to try and use this new memory technology to get rid of the performance gap that you have between DRAM and the power advances and put a layer in between.
Is it faster than solid state disks?
Michele Weiland: It is much faster. If you look at it from the other end, it is about slightly ten times slower than DRAM but one hundred times faster than solid state discs.
What are the typical applications for which you could use this solution?
Michele Weiland: You can use this memory in two ways. You can either use it as very fast storage or as slightly slower but very large memory. It depends on what the use case is. If you have a problem that doesn't fit into traditional DRAM, the way you have it on your system, this is a good way of being able to run on a smaller node count with a very large memory. Or you can have cases of the OpenFOAM application that writes lots and lots of files which will kill performance on a parallel file system. Then you can use this memory as an intermediary step right onto this memory and later on, copy your data off. The performance improvement is much greater because you are not limiting the performance by the slower parallel file system.
This sounds all very computational but if you use it in real applications, what happens then?
Tiago Quintino: In the ECMWF case, we took a part of our workflow, the one that suffers more today from I/O performance issues. We ported it to the system. We also changed the application, the I/O stack below, to take advantage of the devices. What we have been able to demonstrate is that the current bottlenecks we see in our systems today are gone. They are not there anymore and I/O, at least in this system, with the configuration that we tested, is not a bottleneck.
The application is from the European Centre for Medium-Range Weather Forecasts?
Tiago Quintino: Yes, we have run some weather forecasting simulations, running the data but at the same time we have the consumers of that data who read the data as the model is being written. Usually, this reduces a lot of contention on the I/O system. These memory devices do not feel that contention
The users that you have are the users from the European Centre for Medium-Range Weather Forecasts, so these are people who are doing the forecasts or using the applications for the forecasts?
Tiago Quintino: Yes, the usual workflow is to prepare forecasts and send them to all our Member State countries across Europe.
Of course, it is about getting data fast in and out of the computer and of the memory. What size of data are we talking about?
Tiago Quintino: Today, if we would have to use the system, we would put about 20 terabytes of data within a slot of one hour in and immediately out. This is like a rolling access of data throughout the whole forecast which is a 10-day forecast.
With the new system, we assume that you only get to do testing, it is not operational yet?
Tiago Quintino: Yes, it is a demonstration.
What do you expect out of it?
Tiago Quintino: What this system shows is that if you project this performance to what will be at the exascale forecasting system, this system will be able to cope with what an exascale system will throw at it from the point of view of the data.
That means that for exascale we are ready concerning I/O?
Tiago Quintino: We need to build the systems but this certainly can be an important component of that system.
Can you tell a little bit about the technology behind it?
Adrian Jackson: The hardware is Intel solid-state memory, non-volatile memory but of course that comes out of Intel and that is what we have been developing for a while. It is developed in hard disc drives as well. The difference here is this memory and the memory channels are integrated in the nodes. This means you can access it from your programme as if it was main memory, as if it was RAM. You can access it by individual bytes or cache lines and get the data that way, compared to what you would do normally with a disk drive where you have access on large, four-K blocks. As well as the raw performance it gives, as Michele already said, it is not that much lower than main memory. On the prototype system, in a single node, we can get about 108 gigabytes/second of data transfer into memory, if you run a benchmark on this. In the Optane memory, you can do about 30 to 35 gigabytes/second. That is considerable for a single node but it also lets you access that memory in access patterns which are good for your application but would have been bad for the performance in terms of the actual systems we have had in the past.
But of course, that also comes with challenges because now we're putting the storage inside the compute nodes. At the moment we have our storage outside the compute nodes in a parallel file system. A lot of the work that we have done in the project ourselves and other partners including the Barcelona Supercomputing Center, has been to set up a software infrastructure to support that. How do you let users access their data? How do you port applications to users it without having to go through all the work, in an easier way, because they can afford to do this for their big application. However, for a lot of applications we run on our big systems, some people won't have the time at the moment to have all the skills ported. So, can they still use this hardware, maybe not quite get as good performances as Tiago but still get something out of it? Therefore, we have done quite a lot of work on file systems and data schedulers and integration into the whole system to try and support that.
With this type of memory, if you turn off the system, will the data stay on?
Adrian Jackson: Absolutely. The whole point is that it is persistent. You have to be slightly careful if you are using it, because you have to write the data to it and make sure the data has already got to the memory before you turn off the system. That is a slight change when you program it, but apart from that, the data will stay there.
In a sense, it is more like a file system?
Adrian Jackson: It is more like a disc drive. It absolutely is.
Can you tell a little bit about the software extensions you did?
Michele Weiland: One of the key things we worked on is the extension SLURM. We use SLURM as the resource manager on the system. We have extended that to be aware of this new type of memory, because you can use the memory in two different modes. You can use this as either slow memory or as fast storage. You do that by rebooting the nodes. We have made SLURM aware of this. When you submit your job you can say: 'I want my job to run in memory mode and I want my node to have at least, say one terabyte of memory free. You submit your job and SLURM will either reboot the node if nothing like that is available or it will shut your node to the correct place. That is one extension.
Another thing we have done is the following. Because the memory is persistent, you can envisage a scenario where you have a sort of producer-consumer workflow. So, you have an application that produces data and a consumer that wants to read that data. Because SLURM is aware of the non-volatile memory, you can say: 'I am producing data on this node, I want the next job to be on that node', and you just leave the data on node. You don't have this panic to have the data on the parallel file system and getting it back. These sorts of extension are done with SLURM. These are the most significant ones.
Would it be easy to expand the system?
Michele Weiland: In terms of the number of nodes it would not be hard, it is a question of adding more racks. We currently are just under two racks, not entirely full. You can scale up as long as you can afford.
Do you think that later on the companies will take it over and make it their product?
Michele Weiland: It is already a product. Fujitsu has released it as a product in the PRIMERGY and PRIMEQUEST server model series in August 2019.
If you go back to the weather forecasting application, it is very nice that you can speed up the weather forecast, but are there other things that you can do now with this new technology?
Tiago Quintino: Yes. One of the things that is particularly interesting with this technology is not only what it can do to improve what we do today but the horizon that is going to open and the new things that we will be able to do. What it allows, for example, because of its density, the fact that you have byte addressability to a huge data pool in the order of multiple terabytes, is that we can envision now keeping multiple weather forecasts in memory and allow our users to access it by cutting through the data. This is an hypercube of six dimensions - in time, in space, etc. - and they can cut it and access the data in ways that previously would have been completely non-optimal. Therefore, the scientists don't think like that. It is like a constraint that to the current workflow that they do, that they analyze the data in a certain sequence because that is what discs today provide you as the best access pattern but this system is nearly flat with respect to the access across all the datasets. You will be able to access forward and backward in time, in vertical directions. This will open up to high performance data analytics that we have not thought about.
If we talk about the technology future, this is one project of course, what are the next steps that you will take in the technology, or think you will take?
Adrian Jackson: The hardware is obviously Intel and the memory will progress across the second generation, coming out at some point next year. The processes and support will be going forward. We will see that we start to come into production systems. Some of the big US systems that will be installed by 2022/2023, will have this type of memory, not in every compute node, maybe in some islands of the computer, but it is going to be there. There are lots of people working on file systems, single nodes, non-volatile memory, and optimized file systems. Intel has objects stored out which will be pushed forward as well to a similar kind of system for people who cannot develop their own. We are very keen on pushing the software side of things for an efficient integration of tools to move data for users, tools to keep track of wherever the data is, the file system I and my colleagues have developed and who will actually build a whole file system at a cost as low as possible. These are all things that we are looking forward to over the next few years.
But actually, one of the really nice things is that the system has been running in the project that now just has finished. We now have a nice stable usable system and we now have some two or three years to make good use of this. It is going to be a lot of work taking applications and optimize them, seeing how users will use them and how the industry will interact.
You will keep the system up and running in Edinburgh and doing a lot of testing?
Michele Weiland: Yes, the system will be kept running for three years after the project. The project has finished but the system is brand new.
Sometimes in research it happens that the system ends with the project but not in this case?
Michele Weiland: No, not in this case. The system will be around for three years and people will be applying for access. People can contact me for that.
Great. Thank you very much for this interview.