This news blog provides news about the e-IRG and related e-Infrastructure topics.


Funding agencies and academia need to rethink reward structure for computational tool developers to tackle big scientific challenges

During the 4th National eScience Symposium, held in Amsterdam ArenA last October, we were able to talk with Fernando Pérez from the University of California. Fernando Pérez originally is a physicist by training but he has spent most of his career building computational tools after his PhD. He has worked in the fields of applied mathematics and neuroscience. In the last few years he has been part of a team building an institute for data science at the University of California at Berkeley where he has an appointment. He recently moved to Lawrence Berkeley National Laboratory where he is now part of the Data Science and Technology Department. There he is now conducting his research.

For the last 15 years a lot of his work has been around building open source tools for scientific computing and data science, mostly in the Python programming language but not only that. He started building IPython when he was a graduate student about 15 years ago. Over time he became very involved with the entire scientific IPython ecosystem which is a collection of interoperable tools. Each targets a different problem, whether it is numerical processing, algorithms, statistical computing or data visualization. This has been built by a large collaboration of people which started about 15 years ago. IPython is a large project that is also built collaboratively, not just by Fernando Pérez but by many other people. They have a big team. It has evolved into something which is called Project Jupyter. This is the extension of the original ideas of IPython but to multi-programming languages. It is a project that now is where he spends most of his research time in, building the platform with the rest of the team and continuing to push that as an environment for data science.   

Fifteen years ago, Python was relatively new. Actually, it was invented in Amsterdam. Fernando Pérez had the opportunity to visit the Center for Mathematics and Informatics (CWI) in the Science Park at the University of Amsterdam where Python inventor Guido van Rossum originally was working when he created the language. Fernando Pérez had lunch in the cafetaria in that building. He liked to see the birthplace of Python. The language was not originally designed as a scientific programming language. It was built as a general-purpose language but it turned out to be really well adapted to the needs of scientists. It is high-level enough in that it is easy to read and write. It can be used in an iterative way which is very useful for exploring data and iterating on algorithmic questions and computational questions rather than only building software. It is very well suited to the kinds of workflow that scientists need. It also has a very easy interfacing with high performance low-level libraries. The mix of being high-level enough and flexible enough, as well as easy enough for scientists to use and still allowing you to get performance made it an ideal combination. Many scientists from multiple disciplines adopted it.

Fifteen years ago, it might have been only a few people trying this out but over time it has become a major force in the scientific computing world. Fernando Pérez explained that he saw a presentation from one of the technical leads at the Large Synoptic Survey Telescope (LSST) project, which is one of astronomy's biggest projects. The presenter was explaining how their entire software pipeline is based on Python. This is a big victory for this originally Dutch creation. It has had a huge impact in scientific research and in education as well, because these same tools are also used for teaching.

We remarked that 15 years ago, working with data or making it into a profession, was also relatively new. Most scientists at that time collected wrong data, performed their own Fortran programming and did not have people making tools for them.

Fernando Pérez said that there were a few areas or fields that had a bigger tradition of building computational tools, typically physics and neuro-medical computing, as well as the big solvers in partial differential equations. Those were a few selected fields. Even in those fields, the job of building these tools was not really very much rewarded. In fact, for many of us who really got absorbed into this kind of work, we were told multiple times not to do this because it would be scientific career suicide, Fernando Pérez smiled. Indeed, many people were punished professionally by doing that. Gradually, there is a greater recognition in the scientific community at large that this is actually a necessary part of research and that we have to recognize, we have to value, we have to reward, we have to support, we have to fund this kind of work because we are never going to be able to tackle the next generation of scientific questions if we don't have the tool infrastructure to do it. This tool infrastructure is not going to get built simply on people's spare time, Fernando Pérez insisted.

We noted that the conference was organized by the eScience Center. The eScience Center brings multi-disciplinary scientists together with people who know how to program and how to analyze data, and with tool specialists. This is happening more and more in Europe. There are already 25 or more of these eScience Centers in each country.  How is the situation in the US?

Fernando Pérez told that in the US there is also a movement towards building similar kinds of institutions. He is part of a collaboration, part of a team at UC Berkeley that started the Berkeley Institute for Data Science. That was a collaborative project between UC Berkeley, the University of Washington and New York University (NYU), where there are three institutes of this type, obviously each with its own peculiarities because the institutions are different. They were all collectively funded by two foundations, the Gordon and Betty Moore Foundation and the Alfred P. Sloan Foundation, that gave these three universities a common grant to build institutes of this kind within US universities to precisely explore these same questions of the combination of tool development, domain research questions, methodological development, integration within the university environment, career paths and that sort of complex combination that doesn't fit easily into the existing structure of universities and departments, and the existing career paths that at least US academia supports.

Fernando Pérez explained that they were tasked with building entities that could push the limits of those questions. While being in Amsterdam, he had good discussions with the team of the Netherlands eScience Center. He learned a lot and he took a lot of notes about the approaches that they are taking and how they are building their collaborations to see what lessons he can bring back to the US.

We asked whether there were good collaborations going on between Europe, the US and Japan on the eScience side.

Fernando Pérez thought that there are collaborations but there is still a lot we can learn. There are especially differences in how the funding structure is constructed in Europe and the US. This has an impact. There is still a lot of opportunity for better integration in this regard because these are spaces that are very new and often also here, in Europe. We should be breaking new ground and cutting transverse to the grain of the institutions. We are doing things that are a little bit orthogonal to the way traditionally the system has operated. There is still plenty of opportunity for learning lessons across the systems.

We observed that also funding agencies have to adopt new ways of funding and finding ways to deal with this.

Fernando Pérez thought that it was probably worth mentioning that in the space of the open source tool development that has traditionally been a very highly collaborative space - for example, in the part of that world that he knows best which is the scientific Python world - if you look at all of the major projects in scientific Python, they all have large amounts of contributions from both European and US teams, and obviously other parts of the world. There is a lot of activity from both and those teams worked very much hand in hand in a highly collaborative manner.

We asked what new developments or important things currently are happening in this field.

Fernando Pérez claimed that there is a lot. The scientific Python community continues to mature and grow and tackle more complex problems. There has been a great development, probably in the last five years, that makes him very happy. The amount of work that has been done, is not only concentrated on what he would call the core tools, traditionally the numerical, algorithms and visualization tools. Those still continue to be developed. What has been very good to see in the last few years is a big explosion of really high-quality libraries in domain areas. You see good libraries being developed to do analysis of data in very specific problems in geophysics, specific problems in astronomy and radio astronomy, in solar physics, and in social sciences. That is a sign of the health of an ecosystem, that the basic layer is already in good shape and people do not need to go build that foundation that is already being taken care of. Now, communities can develop tools in specific areas and problems that matter to them. All of those are still reusing the basic foundation. That is a sign of a healthy and growing ecosystem.

The other thing Fernando Pérez thought was worth signaling - which is not in Python - is the growing development of the Julia programming language. This is a really interesting modern high-level language, developed with a lot of new ideas and new research from the computer science perspective but bringing also the lessons of systems like Python with a fresh perspective. Fernando Pérez is really excited to see what is coming out of that community which is growing in a very health way. The Julia community has just released the 0.5 version. It is a language that is primarily aimed or originally focused on numerical computing but it is a very general-purpose language that is trying to bring the best lessons of languages like Python together with very modern infrastructure from the compiler perspective and the type theory perspective to allow you to express high-level ideas and out-of-the-box get very high performance.

While in Python, you sometimes do hit performance bottlenecks where you have to resort to switching to C or C++ or to Fortran. The transition between those two is possible but it is not completely smooth. Julia tries to give the scientist all the performance you need while still operating at a very high level. In order to do that, the developers have to rethink the type system, rethink the approach that it uses to have a compiler always available at one time so that even though it's a dynamic language, it uses LLVM (Low Level Virtual Machine) at one time to constantly make machine code with a very sophisticated type inference machinery. It tries to give the best of both worlds: the expressive power and the high-level exploratory feel of tools like Python while having the performance profile of a language like C. It was born out of a research project at the Massachusetts Institute of Technology (MIT), led by Alan Edelman and the Mathematics Department at MIT. There are really a lot of interesting ideas coming out of it, according to Fernando Pérez.

We asked whether this is one of the languages that could be used for exascale computing.

Fernando Pérez thought it is a little bit early to tell. The problems in exascale computing are really complex. Developers have put a lot of thought in how Julia will scale and how parallelism will work in Julia. At least, they are thinking in that direction but it is still a little bit early to tell.

Fernando Pérez ended by saying that he has learned a lot from the work at the Netherlands eScience Center. There are already opportunities for collaboration and contacts.