At the Third eScience Symposium, recently held at the ArenA in Amsterdam, The Netherlands, we had the opportunity to talk with Tony Hey from the eScience Institute at the University of Washington in Seattle about eScience and the fourth paradigm for scientific discovery. Before that, Tony Hey has been leading the UK's eScience programme and between 2005 and 2015, he has ten years with Microsoft Research, directing their external research activities and doing eScience in Microsoft with Jim Gray. Together, they wrote a book titled "The Fourth Paradigm: Data-Intensive Scientific Discovery". In 2015, Tony Hey has been having a sabbatical at the University of Washington's eScience Center. They have some very interesting activities there, so it has been a very interesting time in the spades for Tony Hey.
eScience in The Netherlands is really taking off giving this third edition of the Dutch eScience Symposium with 650 participants. Has it not taken a long time before it took off?
Tony Hey: Yes, I think so. In the UK, we started in 2001 which was very early. The trends were already apparent that there was going to be Big Data in almost every field and that you would need new technologies, new techniques, distributed computing, distributed ways of collaborating and so on. However, it took a long time for this to be embraced by all the fields. As an example, in 2004, I set up the data curation centre in the UK which is very similar to DANS in The Netherlands. I gave a keynote this year at their 10th anniversary. It has made progress because now Dutch researchers have to put in a data management plan, in the US you have to put in a data management plan, and in the UK as well. Data management plans are now becoming part of the scientists' bids for research funds.
If you look at the Horizon 2020 projects, they also now call for data plans. They are not mandatory yet but they are strongly encouraged.
Tony Hey: I think that is the responsible thing. You don't want to keep all data but you should actually have some way of understanding which data you need, which data is needed to support the publication. In my view, this is all connected to what I would call an open science agenda. There is the open access movement. If you want to make it sensible to make the results reproducible you have to link to the data you use to be able to get the results and also the software and so on. You need the publication, the data and the software all linked together and we are only at the beginning of doing that.
Each researcher more or less makes his own data, knows about it and abandons it when he goes on to something else.
Tony Hey: This is what I would call the long tail. There are these big experimental collaborations like the LHC, like the astronomers, where they are very well organized, where they have budgets for software and everything else but many scientists use things like Excel. When you look at an Excel spreadsheet five years later, you have no idea what you put in the columns, what the calculations and annotations are. One of the things I did when I was at Microsoft Research was an open source addition to Excel, so you could annotate an Excel spreadsheet by telling what is in there and what you have done with it. That is the sort of thing that has to become second nature.
At the Digital Curation Conference talk I had taken some soundings in various disciplines. The one I liked best came from an Earth scientist at Santa Barbara. His name was Jim Frew and he called these Frew's laws of metadata:
- Law no. 1: Scientists don't write metadata
- Law no. 2: Any scientist can be forced to write bad metadata. His argument was that you needed to work with the computer scientist and the librarians to automate as much as possible the collection of metadata about your data but also to work with experts like subject librarians to actually make sure that you had metadata that was going to be useful in five or ten years' time.
The RDA movement is also trying to be very active. There was a conference in Paris a few weeks ago with also some 600 participants. Is this interesting from the perspective of the USA?
Tony Hey: I am on the Council for the RDA, so I went to Paris where there was an impressive number of people. My worry is that there are bandwagons and I would like to make sure that RDA delivers things that are useful for communities. It is great to have the IT managers and the computer scientists all arguing about what are the best things to do, I think you also have to engage with the scientists and make sure that you're not making things impossible for them and imposing burdens on them that they will regard as a nuisance and not relevant.
Should this also be a role for the eScience centres and the eScience community?
Tony Hey: We started with eScience in the UK in 2001. I really think it is a great thing that The Netherlands have the eScience activity to focus on and bring people's attention to these issues, doing just the right thing of engaging with the communities. The Digital Curation Centre I set up was librarians and computer scientists and their job was to work with different scientific communities. It took them about five or ten years to get to doing that because they weren't used to doing it. The Netherlands are well positioned for making real progress in many scientific areas because of these initiatives like DANS, eScience and so on.
Of course, data is a very important part, as well as is science to move forward. The other aspect is computing. I remember the days of HPCnet which was about linking HPC to SMEs. The Commission just has announced that there are twelve new centres bringing HPC to SMEs. Does this mean that nothing has changed or that everything has changed but that this is still important?
Tony Hey: I have spent a lot of time on HPC as you are well aware of. I do think it's extremely difficult to bring HPC to SMEs. Now we are getting near the end of Moore's Law and where you don't automatically get performance improvement despite waiting for 18 months in which you double your performance, this does not happen anymore so you will actually have to make some effort in the software. The difficulty is that for most people programming a modern supercomputer is extremely difficult to get the best of it out because you now have mixed programming models, shared-memory OpenMP, with MPI between these shared-memory nodes, you have hierarchies of data caches and things like this. It becomes extremely difficult to optimize your calculation. Normal users shouldn't need to do that, certainly not SMEs.
I am still interested in this but I think the challenge for me is not heroic compilers or magic programming languages. There is an approach called templates where you teach an SME that they recognize the sort of parallelism in their application and they then write their code using these templates. The templates called libraries are written by experts and are optimized for this architecture or that architecture. The user doesn't have to know that. It is not going to be the most efficient implementation you could possibly do of their application but if it speeds it up by a factory of 20, that's fine. It doesn't have to be a factor of 99. If you can get 40 percent speed-up, that is still pretty good.
So it is still useful to help SMEs but with templates?
Tony Hey: We did MPI way back in the eighties and nothing has replaced it. Message Passing Interface is very low-level programming and really you need to try and raise the abstraction. The only way to do that is by using templates. That's my current favourite approach but we will see. The architectures certainly have become more complicated than they were in the nineties when we did the earlier attempt with SMEs. This certainly needs redoing because these libraries need to be rewritten and so on.
One of the topics that's of interest is open access. What is your vision about that?
Tony Hey: The year 2013 was a tipping point for open access. It was triggered by something that happened in the US. It's the Office of Science and Technology Policy of the White House came out with a memorandum in February 2013 which required all the federal agencies that fund research to make the results of their research more accessible to the general public. That included that the research papers but also the data on which the research results were based were made available. That had a dramatic change because now we always had open access to publications at the National Institutes of Health with its National Library of Medicine but that was an outlier. Now, the NSF and the National Department of Energy, for example, and the Department of Defense, all these agencies are now obliged to have a mechanism to make their research publications available to the general public without having to pay a large sum to a publisher. There is a delay of 12 months or so but open access is really here.
That was followed by a meeting of the General Global Research Councils - that is NSF but also RCUK in the UK and similar organisations in The Netherlands and Australia and so on - and Germany hosted that particular meeting. They also approved of this principle. Even the European Parliament has approved open access. I think this is going to happen and the movement in Europe which has been a pioneer with open university repositories keeping the research papers, is now taking off in the USA. They were very slow but now there are mandates at MIT, Harvard University, Berkeley, the University of California and so on. There has really been a change.
That will really progress science much faster.
Tony Hey: Yes, my hope is that by going from the research publication, you can also go to the data it is based on. You have to have a persistent identifier for the data. You can do some more research, publish it and this can actually speed up the whole process of research and to get more productivity into research which we need and make the whole enterprise less duplicative. People repeat the same experiments, not knowing other people have done it. You have this global digital library which contains not only publications and data but also in fact software which you use to generate the results.
To do all this kind of new things in eScience and data science, you need people of course. Is it still interesting for young people to step into science?
Tony Hey: I think it is but I see a problem in that the people who get the rewards in science are the full-time researchers but there is also a category of people who are very good at writing scientific software and they put a lot of effort into making software that is widely used around the world. They don't typically get ten-year academic jobs and professorships and credit in the same way the researchers do. The same is going to be true for people who deal with data. You need to make sure there is a way of credit, you need to make sure that there is actually progression for their career. They have a recognizable talent and it needs to be recognized in the university system or whatever that they can get a ten-year job.
At the eScience Institute in Washington where I'm based at the moment, they are part of a triumvirate - Berkeley, NYU and the University of Washington - which is funded by the Gordon Moore Foundation and the Sloane Foundation. It is intended to look at these alternative career paths. Is there a way to make sure that the people who are the unsung heroes, if you like, actually have a chance of a reasonable career at the end of the day, rather than surviving on soft money grants, living from hand to mouth?
It will probably take some years to change that.
Tony Hey: This is one of the challenges. The Moore-Sloane activities are tempting to see how you change that in three universities and we will see how that gets on.