JOSSCast #8: Deep Learning for 3D Protein Structure Predictions – Giulia Crocioni on DeepRank2

Subscribe Now: Apple, Spotify, YouTube, RSS

Giulia Crocioni joins Arfon and Abby to discuss DeepRank2, a deep learning framework for 3D protein structure predictions.

Giulia is a Research Software Engineer at the Netherlands eScience Center where she uses different machine learning techniques to develop and contribute to methodologies and applications to answer life sciences research questions.

You can follow Giulia on LinkedIn

Episode Highlights:

  • [00:02:16] Giulia introduces herself and her work at the Netherlands eScience Centre in the Life Sciences section.
  • [00:04:09] Introducing DeepRank2: An open-source deep learning framework for data mining of protein-protein interfaces or single-residue variants.
  • [00:10:14] Target audiences (researchers, developers) and applications (drug design, cancer vaccine development)
  • [00:13:48] Ecosystem overview from complementary projects like AlphaFold, to training data sources, to using PyTorch.
  • [00:23:27] Open Source & Contributions

Transcript:

[00:00:05] Abby Cabunoc Mayes: Welcome to Open Source for Researchers, a podcast showcasing open source software built by and for researchers. My name’s Abby,

[00:00:11] Arfon Smith: My name’s Arfon.

[00:00:12] Abby Cabunoc Mayes: and we’re your hosts. So every other week, we interview an author published in the Journal of Open Source Software, or JOSS.

This is episode 8, and today we’re chatting with Giulia Crocioni about their paper, DeepRank2, Mining 3D Protein Structures with Geometric Deep Learning. I really enjoyed this conversation because Structural Bioinformatics was actually my favorite course in university, but I never did any of it professionally, and I was like, oh yeah, proteins, I remember how this works.

[expand]

[00:00:39] Arfon Smith: I also enjoyed it. I always forget what proteins are. I know they’re very important, and so it’s always good to have it like that. Reload it into your brain a bit. I always get kind of confused with structure versus active structure, and like, because there’s all the sort of difference between, yes, there’s the sequence of molecules and atoms, but then actually how their shape turns out to be way, way important, right, for Drug design and biological function and stuff, so it’s cool.

This is like one of those software packages where clearly there’s just a ton of real world value in this software. Speaking as an astronomer, that can’t always be said.

[00:01:16] Abby Cabunoc Mayes: It was also nice to hear more about the the eScience Center. We have a repeat guest from the institution, which is doing amazing work.

[00:01:23] Arfon Smith: Look, I’m just going to say, if we are still doing this podcast in a year, we should do an on site recording at the Netherlands eScience Center.

[00:01:31] Abby Cabunoc Mayes: We’ll just walk down the halls.

[00:01:32] Arfon Smith: Just going door to door with a mic.

Yeah, they do great work.

[00:01:37] Arfon Smith: There’s lots happening there. A thing I learned today was that they do a competitive process for applying for research software engineering time. Which, on reflection, makes a ton of sense. These are very specialized skilled individuals who can help you with your research problems.

And so why wouldn’t you make that a competitive process to ask for their time? They’re clearly doing lots of interesting work there.

[00:01:58] Abby Cabunoc Mayes: Yeah, well, let’s play the interview.

[00:02:00] Arfon Smith: Yeah, let’s get going.

[00:02:01] Abby Cabunoc Mayes: Welcome to the podcast, Giulia. We’re so glad to have you here.

[00:02:04] Arfon Smith: Yeah,

[00:02:04] Giulia Crocioni: Thanks, I’m so glad to be here.

[00:02:05] Abby Cabunoc Mayes: I know we’ve had one other person from Netherlands eScience Center, so we’re definitely a big fan of where you’re at. But I guess that’s a bit of a spoiler for your intro, but do you want to say a few words about yourself and where you’re calling from?

[00:02:16] Giulia Crocioni: Yeah, sure. So, I’m Giulia Crocioni. I am originally Italian. Now I live in Amsterdam, in the Netherlands. And I work at the Netherlands eScience Centre in the Life Sciences section. My background is in biomedical engineering and data science. And now in the Life Sciences section, I help researchers to develop software, especially in genomics and life sciences, more general domain.

[00:02:40] Arfon Smith: I have a question straight off the bat, which is how many research software engineers are there at the eScience Center? There seems to be

[00:02:47] Giulia Crocioni: Around 80,

[00:02:48] Arfon Smith: Oh, wow. Okay. That’s

[00:02:50] Giulia Crocioni: Yeah, we grew quite a lot in the last couple of years. And myself, I started in 2022. So, I was in one of those big rounds.

[00:03:01] Arfon Smith: So does it feel like a real proper team? Do you have an identity as a research software engineer or are you very local to the particular research problems you’re working on?

[00:03:10] Giulia Crocioni: Yeah, so we do have teams, so we have different sections. I am in the Life Sciences section. So everything that is bio related goes there, project wise. And then within the section, we have also different teams. I’m in a team called Team Flow. We go with the flow and we do big data and health related projects and also machine learning related projects in the domain of the Life Sciences. We work really close to each other. We do not participate in all the projects of the team, but usually we work in two, three people at the same time for each project.

[00:03:47] Arfon Smith: Nice. Nice. It sounds like a great place to work.

[00:03:49] Giulia Crocioni: It is, it definitely is, yes.

[00:03:52] Arfon Smith: Cool. So we’re here to talk about DeepRank2 and so I guess maybe first question, was there a DeepRank 1? Is this the second version of this

[00:03:59] Giulia Crocioni: Yeah, there were many.

[00:04:02] Arfon Smith: Oh okay, well maybe there’s a deep rank zero as well.

So tell us about the software, what it does why you started the project, what kind of problems does it solve?

[00:04:09] Giulia Crocioni: So first, I think it’s good also to give a bit of context about how we work at these projects. So basically, our organization is a national center that awards research projects based on calls for proposals. And what the awarding projects win is hours of work from the research software engineers of the organization.

And DeepRank2, is part of a very big project awarded in the call of 2021. This call was awarded to Li Xue, which is an assistant professor in the Department of Medical Biosciences at Radboud University Medical Center in Nijmegen, always in the Netherlands. And this package is actually an improved, unified version of other three packages.

That were two out of three of them were always developed within another of our internal calls. The main point about this package was that we noticed starting from our own stuff developed by us, that usually the software solutions around for these kind of problems, were highly specialized and especially lacked the flexibility.

Also in the previous call case, in which two packages called one DeepRank and the other one DeepRank GNN were developed they were sort of similar to each other, tackling the similar research questions, but from slightly different points of view. So, being more specific, the very first DeepRank package was developed for implementing convolutional neural networks, and trained them on these molecular structural data represented as grids.

Then there was a second version of it, called DeepRank GNN, that was basically the same thing, but using graphs as data representation, and then graph neural networks to train on this kind of data.

Then there was also a third version developed, not by us, but by one of the postdocs of the lab at the RadboudUMC. And she was focusing on pathogenic variants in protein protein complexes using convolutional neural networks again. But this was a slightly different representation of the same data.

And so we had the three versions that sure were doing different things, but not so far from each other. And they were having slightly different APIs, so you couldn’t just plug and play the same code to the different packages. Some documentation was lacking, some tutorials were lacking as well.

And so we unified everything in this enhanced version. And now this one is able to do all of these things.

[00:06:42] Abby Cabunoc Mayes: Oh, that’s great

[00:06:43] Arfon Smith: it sounds like a perfect sort of aggregation of lots of different

[00:06:46] Giulia Crocioni: Yes,

[00:06:47] Arfon Smith: Independent software projects packaged into a generalized library, which sounds super useful.

You said something right at the start that was super interesting I just have to ask about. You said people apply for software engineer time? Did I hear that right? So researchers apply for research software engineering capacity to be applied to their domain. Is that right?

[00:07:06] Giulia Crocioni: Correct. Yes, and these people are Dutch researchers in different stages of career. But yeah, but there are some requirements from this point of view.

[00:07:15] Arfon Smith: I just really like that model. I like it for a bunch of reasons. It really reinforces the value of the time of those individuals. But it also is like a computing and research computing grant in some ways, right? Like people are used to applying for super computing time, but what about applying for research or for engineering?

Anyway, that was a side note.

[00:07:34] Abby Cabunoc Mayes: I remember talking to Nico about how he often collaborates with these different organizations around the Netherlands. I assumed that you just found these collaborators, but having them apply for time, that makes a lot of sense.

Anyway, for the original DeepRank and the DeepRank GNN, I know those are different implementations for doing the analysis. Is there a difference for the end user who’s actually using the software? Does one do something better than the other?

[00:07:55] Giulia Crocioni: It really depends on the problem. Yeah, in absolute terms it’s difficult to say if one architecture is going to work better than another one, Or a type of data representation is going to work better than another one.

Of course, there are some general things that we can say, like that graphs are definitely a more compact and efficient data structure than grids, for example. And they represent the same amount of information in much less space and also in a much more compact structure itself, which is a graph structure.

But still, then it really depends on the problem. So you really need to just try and see with your data set, with your neural network architecture, and with your type of target as well, what works best.

[00:08:39] Arfon Smith: So you can use a graph to represent a structure, I can think how that would be, or you could use some kind of Cartesian grid or something. Those are both just valid ways of representing structures. But the sort of underlying technology or the network so a neural network optimized for graph structures it’s just then you get to choose different types of algorithms for traversing those structures. Is that sort of the right way to think about it?

[00:09:05] Giulia Crocioni: Yes, and also to give you a bit of more context. So how it works is that first we create the graphs. So the package internally first, when you plug in some data, it creates the graph representation of the data, and then if the user asks for it, it maps the graphs to grids, because these graphs also contain the information about a spatial position of residues or amino acids, and so everything can be just mapped straight away to a three dimensional grid.

And then according to this, we have different classes of datasets that a user can define. So the user can define either a graph dataset based on graphs representation, or a grid dataset based on grids representation. And according to which one the user chooses then it’s also possible to use either convolutional neural networks, that are the ones to be used with three dimensional grids, or graph neural networks, the ones to be used with graph representation.

So, basically, the networks are different, so the layers are gonna be different. And we use PyTorch for all these options.

[00:10:12] Arfon Smith: Okay. Very good. Yeah.

[00:10:14] Abby Cabunoc Mayes: Yeah. Yeah, that was helpful. That’s helpful just to understand the span of uses that this can have. So with that in mind, who is your target audience for the software?

[00:10:22] Giulia Crocioni: Yeah, so the target audience is very broad. So, in general, any researcher that needs to run experiments based on neural networks and based on this kind of structural molecular data of course, that doesn’t want to spend too much time in data processing, which is a very tedious phase, usually, because also just figuring out how to map these data, which features to generate, how to use them, and yeah, that’s already a big chunk of the pipeline.

And then maybe the users can be interested in using their own machine learning pipeline, so maybe they have their own TensorFlow Neural Networks, or they don’t want to implement that as well, and so they can just use our PyTorch pre implemented pipeline. But anyway, they can be both researchers that are also programmers, but just do not want to spend time on this part because they are maybe interested in trying out some more fancy kind of neural networks, for example, so they are not interested in the data processing part so much.

Also people that are not experienced at all with programming. So that’s also why we provide a very extensive documentation and tutorials part, because indeed this software is also taught for people that do not have much experience with programming at that low level.

[00:11:40] Abby Cabunoc Mayes: Okay. Yeah, that makes sense. And I know in the paper you mentioned things like drug design, immunotherapy. Is that also on the research side, what your target audiences is?

[00:11:47] Giulia Crocioni: Yes, yes, because maybe researchers identified a class of proteins that are really of interest, but then they want to select a subset of these proteins. And so to have a predictor that runs predictions with high accuracy on such complexes can be something very, very useful for drug design, for example. Or an application that we actually developed using this package is about cancer vaccines.

So in developing certain types of cancer vaccines, it’s really important to understand if certain types of proteins that are called MHC proteins, I’m not digging into that too much now, but certain kind of surface proteins, if they are able to bind with peptides that are within the cell, then they are able to expose these peptides on the surface of the cell. And if the cell is a tumor cell, then the immunogenic system is able to activate and to kill the cell.

So it’s very important to select these peptides well. And how to do that? Well, you can train a neural network on thousands or even millions of data that tells you which peptides are good candidates by predicting which ones have a good binding affinity with these proteins. So that’s what we did as a use application for the software,

[00:13:04] Abby Cabunoc Mayes: Nice. Yeah, and I haven’t heard much about cancer vaccines, so that’s an interesting application for this. Yeah, thanks.

[00:13:09] Giulia Crocioni: definitely.

[00:13:09] Arfon Smith: A bit of a sort of sideways question, which is just to help sort of calibrate where we are in the sort of technology space here. So I’ve read about things like AlphaFold and protein folding challenges. And actually, even before that, like the Foldit challenge, where the citizen scientists were involved in trying to through a game, like help fold fold large molecules, proteins in a browser and it sort of gamified that.

I was just curious if you could sort of calibrate me a bit on how does this work relate to some other work that goes around that happens around protein structures, protein foldering. Which part of the sort of problem are we working on here with this?

[00:13:48] Giulia Crocioni: Sure. So what alpha fold and in particular the last version of it, which is alpha fold two. What it does is that it takes as input sequences of amino acids, so of these basic building blocks of proteins, and then it outputs. the coordinates in space that these amino acids are gonna have when the protein is folded, so when it’s in its active state.

Because if you have the sequence of a protein, actually you don’t know, one of the most important information about it, which is structuring the space. Because the way in which proteins fold determine a lot their properties, what they can do and what they can achieve, what they cannot. So it’s really, really important to know.

So AlphaFold takes as input the sequence and outputs a file containing both the sequence, but it ends with information about the Cartesian coordinates of each residue or amino acid, however we want to call it. And then what we do instead is that we take this information, so this file already reached with information about the space.

So we have two types of information as input. We have the sequence of amino acids and atoms plus the deposition in space of these building blocks. And then starting from there, with domain knowledge and with the computational part of the data processing of the package, we enhance the information adding all kinds of physicochemical features, so the charge of polarity of the residue, the electrostatic field, the van der Waals potential, all sorts of things.

And this is what is used in our pipeline to train our networks. So we are sort of, after the, the AlphaFold 2 phase, let’s say.

[00:15:37] Arfon Smith: Okay, that’s useful to know. And actually, related to that, what would be your sort of source material that you would train on? Is it lots of experimental data? How do these networks get trained ? I’m aware of techniques like X ray crystallography, and you can get structures from there. What are the sort of source materials that have made your trained networks actually useful to academics?

[00:15:59] Giulia Crocioni: Yeah, so it can be all sorts of materials, so it can be the material derived from the crystallographic experiments. Nowadays they are publishing more and more data sets containing these kind of experiments results. Those are very useful. And then there are also more synthetic data, so generated by software like AlphaFold 2, but not only.

There are also many other software that do similar things, maybe tackling specific structures. AlphaFold aims to be more general and working with everything but then there are many, many similar software that are just narrower in this sense. So it can be both experimental data or synthetic data, in this sense, that are based on, some way, on some physical modeling.

And yeah, which one is better? Again, it really depends on the specific data we are talking about. Cons of experimental data is that they can be labelled wrong. So you need to do a data cleaning phase very often. You need to be aware that there may be mistakes in how experimentalists classified things or the techniques that were used, maybe they’re old data.

But then on the other end, the ones that are right, you’re super sure that are correct and fully represent the chemistry and physics that is behind the structure. On the other hand, synthetic data, they aim at reaching a very high accuracy from this point of view. But again, it depends on the tool, on which kind of model you rely on inside.

[00:17:31] Abby Cabunoc Mayes: Yeah, and I know data wrangling is always a huge deal. Especially genomic data, just trying to clean it all up, so I understand that pain of training data.

And I know that you’ve mentioned PyTorch a couple times. Can you tell me why you decided to use PyTorch? And how that’s been going?

[00:17:47] Giulia Crocioni: So, PyTorch is one of the biggest Python platforms for developing the deep learning based algorithms. We went for that one because it’s very well documented, it has many tutorials, and many people are using it, so many people know about it. So it’s both easier for developers to dig deeper into it, but also for future users or future developers to keep going and build up on that.

So this is also something that as the Netherlands Science Centre, we really like to do in general to not reinvent the wheel. We always try to evaluate which are the open source softwares available because it’s always better to build on and, again, to not reinvent the wheel or to not start from scratch because other very smart people very likely have already thought about things and likely they have already developed them well.

[00:18:41] Arfon Smith: Yeah, for sure. It always makes sense to use something off the shelf if you can, I think, especially for core dependencies. And PyTorch seems to be wildly popular these days and I think rightly so. I was going to ask a slightly different question related to the sort of infrastructure aspects of the project.

What’s the cost of training neural networks here that you’re working with? Is this an expensive operation? Is there particular hardware that people have to use? And so sort of training and then also, you know, inference and running the network. Could you say a little bit about what sort of hardware resources people would typically need to make use of this software?

[00:19:18] Giulia Crocioni: Sure. So the training is definitely more expensive than the inference time. So the inference time, if you have a machine that allows you to just run a model you are done. And we are not talking about large language models that could occupy terabytes of space. So we’re talking about much smaller models here.

So the inference part is definitely less a bottleneck than the training part. On the other end, the training part, again, it really depends on how many data you’re training on. Usually you need at least the order of thousands, at least. A million, the order of millions is a good order for this kind of research questions usually.

And for that, it’s really advisable to have a very powerful hardware. So for example, here we are using the Dutch supercomputer. It’s called Snellius, and for the experiment that we run with DeepRank2 for showcasing it we use only one GPU but the package in general supports parallelization both CPU and GPU level.

So, yeah, the package can definitely, as PyTorch, indeed, can definitely handle parallelization. So, optimization of the training procedure, but also of the data processing. But it’s more, yeah, on the users, and yeah, they should need to have something a bit more powerful than maybe your local machine.

[00:20:44] Arfon Smith: Yeah, makes sense.

[00:20:45] Abby Cabunoc Mayes: Yeah, and I guess related to that would users have to train a new neural network for each different type of target, or how often do you think a researcher would have to go through that process and, like, get a hold of a GPU or a cluster?

[00:20:58] Giulia Crocioni: Yeah, so, in general, in machine learning, when you change a target the weights of your networks very typically needs to change. So if you’re using a network that has been trained on another target, and then you try to do inference on a new target it’s very likely that it’s not going to perform well, because the network has been optimized on that specific target that you used before.

So yes, in general, you do need to retrain networks on new targets. And yeah, depending on the problem, you may think about either classification or regression. So maybe you have continuous targets. So going back to the cancer vaccine example, you can have the binding affinity value, which is a continuous value of these molecules.

And this is a regression problem. Or you can think about, okay, I want to predict if these two molecules bind or not. So this is more a binary classification problem. And also according to these, in these two different examples that I mentioned, you’re going to use also a different loss function for optimizing your network.

So you definitely need to, yeah, to train again if you change it.

[00:22:02] Arfon Smith: thank you. That’s useful to know and for people who want to use the package as well. I want to switch tracks a little bit and ask about open source software. This is JOSS is a journal that’s about open source software. This is an open source package. Tell us a bit about your experience with open source software and maybe the reason for publishing this as open source software was in particular.

[00:22:22] Giulia Crocioni: So we think as organization, first of all, that open source software is the base for collaborative science. And collaborative science in turn is the base for a long term working science as well. So, yeah the JOSS journal in this sense really alliance with our principle and our vision as well.

And we also think that when you have an open source software you can encourage transparency, you can encourage reproducibility, trust, and then you can also involve the community more. These are just advantages because then even when I’m not going to be here anymore to maintain the package, if the package has been spread enough across the community, there’s going to be other developers willing to maintain it.

And this is something that always helps. Keeps going and builds up.

[00:23:11] Arfon Smith: Yeah. No, I, that vision definitely strongly resonates certainly for me. I’m assuming for Abby too. I’m curious though, is there some kind of manifesto or something at the eScience Center that really embraces this? Is it written down, these ideals, or is it just sort of

[00:23:27] Giulia Crocioni: yeah, we do have, yeah, we do have a vision statement, a strategy statement on our website. Yeah,

[00:23:34] Arfon Smith: Oh, cool. Okay.

[00:23:35] Giulia Crocioni: When we have to do design choices, for example, when we are at the beginning of a project and we need to pick packages, libraries, we have to do many choices.

We always prefer to go for open source existing software. And again, if something already exists that does something similar to what we need, it’s always good to go there. Maybe they have a repository online, so maybe you can, even yourself, you can contribute to the package. You don’t need to restart from scratch.

[00:24:05] Arfon Smith: Yeah. Nice. Very nice.

[00:24:07] Abby Cabunoc Mayes: And Giulia, had you done much open source before joining the eScience Center?

[00:24:11] Giulia Crocioni: I did a bit during the university, but then I worked for an insurance company for a couple of years, and there of course nothing is open source, because yeah, that type of industry is a bit different.

[00:24:23] Abby Cabunoc Mayes: Yeah. So is that a big draw to joining the eScience Center?

[00:24:27] Giulia Crocioni: Yes, yes. And intertwined with the research and yeah, I really love the vision statement that we have and in general this concept about using and developing open source software in order to spread it more across the community and engage the community. It’s really, really nice, and this is how science should be done, in my opinion, and unfortunately it’s not always like that, for many reasons, of course, it’s not an easy topic, but still, I believe that going towards this direction is the right one.

[00:25:01] Abby Cabunoc Mayes: Yeah, and one of the reasons why I picked this paper for the JOSSCast was looking at your repo and just seeing how many contributors there were and just how collaborative you all seem to be working. So excited to see this being open for contributions.

[00:25:14] Arfon Smith: This is a, obviously a really nice package and it was really fun to see it published in JOSS, so thank you for publishing with us. I guess I wanted to ask, what are the sorts of things that you’re looking for in terms of new contributions? Are there obvious things that people could be doing to help improve, extend DeepRank, maybe prepare for DeepRank 3 if that’s gonna happen someday, I don’t know. What contributions do you especially want as a team?

[00:25:39] Giulia Crocioni: Yeah, so it will be amazing to have both structural biology more domain oriented people, but also programmers who are more machine learning slash deep learning oriented people. I think both can contribute a lot and also we would like to incentivate the mix of these communities that very often are very much separated between each other.

Yeah, and in general just open issues, open PRs. We also have a nice discussion tab on our repo and we always welcome new contributions. It can be that you tried the software and maybe you haven’t figured out how to install it, or you run it and you incur into an issue an error, or you would like to see a feature implemented there.

So it can also be just new requests, and that we see how it fits with our timeline, or just a discussion. So again, whatever topic users can think is interesting, we are open to pick it up.

[00:26:39] Arfon Smith: Very nice, yeah, lots of important work is, especially with a more mature piece of software when you’ve actually got people using it is just supporting people who are using it and answering questions, right? There’s a lot of non code creation work. That’s still essential for any big projects.

So cool. Okay.

[00:26:56] Abby Cabunoc Mayes: Yeah, so where can we find you online and keep up to date on your work, both personally and with DeepRank2?

[00:27:01] Giulia Crocioni: Yeah, so for DeepRank2, definitely on GitHub, in our repository. So we have this huge DeepRank organization that also contains many other interesting repos including the archived older ones, and there you’re going to find DeepRank2 repo. And then on LinkedIn, through the eScience Center channel, of course, they also post news about DeepRank2.

And yeah, me on LinkedIn or on Instagram, but that is less related to , to my software engineering work.

[00:27:30] Abby Cabunoc Mayes: Do you have pictures of stroopwafels?

[00:27:32] Giulia Crocioni: Yeah

[00:27:34] Abby Cabunoc Mayes: When Nico was here, I was like, I love Stroopwafels, but I’ve never been to Amsterdam. Anyways,

[00:27:41] Arfon Smith: yeah, Americans seem to have a lot of them. I know you’re Canadian. I know you’re not American, Abby. But in North America, there are many Stroopwafels. A surprising number.

Giulia, thank you so much for coming on the JOSSCast today and telling us about DeepRank2. It’s been really fun to get to talk to you and learn about your work.

So thanks again.

[00:27:58] Giulia Crocioni: Thank you so much. It was amazing to be here. I loved it.

[00:28:01] Abby Cabunoc Mayes: Thanks, Giulia.

Thank you so much for listening to Open Source for Researchers. We showcase open-source software built by and for researchers, you can hear more by subscribing in your favorite podcast app.

The Journal of Open Source Software is a community-run journal relying on volunteer effort. If you’d like to support JOSS, please consider making a small donation towards running costs at numfocus.org/donate-to-joss. That’s N U M F O C U S.org/donate-to-J O S S.

Open Source for Researchers is produced and hosted by Arfon Smith and me, Abby Cabunoc Mayes. Edited by Abby and music CC-BY Boxcat Games.

[/expand]

Video

JOSSCast #7: Adding defect analysis to the Materials Project – Jimmy Shen on pymatgen-analysis-defects

Subscribe Now: Apple, Spotify, YouTube, RSS

Jimmy Shen sat down with Arfon and Abby to discuss the role of defect analysis in semiconductor research, the Materials Project, and the development of pymatgen-analysis-defects.

Jimmy is a postdoc at the Lawrence Livermore National Laboratory where he tries his best to automate himself away.

You can follow Jimmy on GitHub @jmmshn, Linkedin, or on Google Scholar.

Episode Highlights:

  • [00:02:19] Introducing pymatgen-analysis-defects and the Materials Project
  • [00:07:09] pymatgen packages
  • [00:07:36] Importance of defects in semiconductor research
  • [00:11:19] Target audiences and alternatives
  • [00:15:11] pymatgen-analysis-defects in the broader open source ecosystem
  • [00:18:15] JOSS review
  • [00:19:12] Contribute to pymatgen-analysis-defects

Transcript:

[00:00:05] Arfon Smith: Welcome to Open Source for Researchers, a podcast showcasing open source software built by and for researchers. My name is Arfon

[00:00:12] Abby Cabunoc Mayes: And I’m Abby.

[00:00:13] Arfon Smith: and we’re your hosts.

Every other week, we interview an author published in the Journal of Open Source Software, JOSS.

This is episode seven, and today we’re chatting with Jimmy Shen about their paper, pymatgen-analysis-defects: a Python package for analyzing open source Point Defects in Crystalline Materials. Jimmy is a postdoc at the Lawrence Livermore National Laboratory, which I’ve been to. I’ve sat outside on the big sign.

It’s I’ve not been in, but I’ve sat on a wall which got like, very nice, nice California day

[expand]

[00:00:44] Abby Cabunoc Mayes: you got a

[00:00:44] Arfon Smith: years ago. I actually probably can dig out the photo if you’re interested.

[00:00:48] Abby Cabunoc Mayes: In the show notes!

[00:00:49] Arfon Smith: yes I learned today that defects are good if you’re a semiconductor researcher. So that’s today I learned.

[00:00:57] Abby Cabunoc Mayes: And I, so I don’t think very much about hardware. Computers just work magically for me. So this was a nice way to get into more of the nitty gritty, and it was interesting hearing the kind of research that’s being done to make them even more efficient.

[00:01:09] Arfon Smith: I did. For a minute, I got a little bit lost in my head with the, you do semiconductor research with computers to make computers better doing semiconductor research. So it’s like this weird recursive loop. I think it makes sense.

[00:01:24] Abby Cabunoc Mayes: Yeah,

[00:01:24] Arfon Smith: Yeah, I think it’s fine.

[00:01:25] Abby Cabunoc Mayes: own tail is what Jimmy said.

So.

[00:01:28] Arfon Smith: Yes, exactly. Yeah. So, and it turns out this is a very rich ecosystem of Python tools, I think which was cool to learn about.

What else? What else did we learn?

[00:01:37] Abby Cabunoc Mayes: I think, yeah, the other big thing for me was that giant ecosystem of the Materials Project, and just seeing how that ecosystem’s able to thrive and just bring in more people. been really interesting.

[00:01:49] Arfon Smith: Yeah, I agree. Side note, I had not noticed that the same group had published so much in JOSS, so there you go, that shows how much attention I’m paying. But that’s fine. Actually, no, in my defense, probably some of that stuff will have gone through material science track, some of it will have gone through, the computer science track, so they will have been managed by different editors and editors in chief.

But, it’s cool. They’re clearly producing a lot of great software and this was a really nice project to hear about today. So,

[00:02:13] Abby Cabunoc Mayes: Alright, let’s play the interview.

[00:02:14] Arfon Smith: let’s go.

Welcome to the podcast, Jimmy.

[00:02:16] Jimmy Shen: Hi. How you guys doing?

[00:02:17] Abby Cabunoc Mayes: We’re

[00:02:18] Arfon Smith: Good. Great.

[00:02:18] Abby Cabunoc Mayes: Great to have you here.

[00:02:19] Arfon Smith: All right. So Jimmy, tell us about why you started the project. What does this code base do?

[00:02:23] Jimmy Shen: Yeah so The project itself is an extension of a much bigger project called pymatgen, which we think about as the object oriented interface to material science and chemistry, and it’s what’s being used to power the Materials Project, which is a very popular online repository of data and search engine for materials properties. it’s in fact, I think, basically the most popular materials informatics platform on earth. And a lot of modern day material informatics is really focused on automating quantum chemistry workflows and generating and curating a lot of data from these workflows into predictions, scientific insights, and also data for machine learning models to then make more novel predictions. So, this code is really kind of designed to work independently as a defect analysis package for one thing, but it’s also designed to work with a lot of the automation tools that we’ve been building over the last couple of years including another recent publication in JOSS for this package Jobflow which is a decorator based general high performance computing workflow engine that can help you orchestrate and manage lots and lots of calculations on your high performance computing system.

So it’ll automatically define pieces of work that you need to do in Python and serialize those of work and put them onto a database of tasks to complete. And then once it completes those tasks, it’ll grab, store the data and and put them into its proper format on a combination of MongoDB and AWS S3 storage options. So basically this package is really meant to interface with the surrounding ecosystem of Materials Project packages and make automation of point defects in semiconductor materials a lot more accessible than was previously possible.

[00:04:39] Arfon Smith: That’s really interesting background, Jimmy. Thanks for that setup.

On an earlier episode, we had another person talking about some quantum chemistry packages and some work that they’ve done.

I was curious if you said, you know, there’s, for a given defect, there may be multiple different calculations run, which makes a ton of sense. Are you typically doing those with the pymatgen project as well? Or could you say what packages those are that you’re using, or is that part of pymatgen?

Could you say a little bit about the actual calculation piece and where that happens and who does it?

[00:05:10] Jimmy Shen: So yeah, so basically we try to be very cognizant of the fact that like we’re doing a lot of software development and like a lot of times the tools that we build are not material science specific at all. So basically every time we kind of identify a whole area of problems that need solving you try to solve a very particular problem first.

So for workflow automation . There is a package, it’s Jobflow, that is not quantum chemistry aware at all. It has no idea about any of the kind of eventual goals that we’re trying to reach, it’s just there to serialize jobs and put the tasks that you need onto a database. So the fundamental thing for defining pieces of work, is a separate package that doesn’t consider material science.

There are kind of weird interdependencies between all of these packages, but we try to isolate out individual pieces of computer science ideas, like workflow definition and workflow automation into separate packages, and then basically at the very end, you slap pymatgen, which you can think about it as the object oriented interface to quantum chemistry. You slap that on top of whatever set of general Python tool you build and then you get a new kind of framework for doing something.

Yeah, so in this instance we basically take this package and quite a few other things and you combine these with Jobflow. There’s a another package that’s going to come up. It’s been out for a while It’ll be published very soon called atomate2, which is basically combining all of the new material science development that we’ve done on different parts of pymatgen and possibly other codes. Combine this with Jobflow and you get a lot of formal definition of quantum chemistry workflows.

[00:07:09] Abby Cabunoc Mayes: Another reason why I was excited about this, this paper was actually nominated by one of the editors because it was part of the Materials Project ecosystem. It was nice to hear just you talk about that a little bit.

I did have a question about defects. So I understand building a database super important, especially in this AI age. We need clean data that we can use to get really valuable insights. But why would we want to be studying defects? I’m not as familiar with the material space.

[00:07:36] Jimmy Shen: So, the end, well, maybe not the end goal, but like a very, very important part of this is understanding the operation properties of semiconductors. So, a lot of what we do is kind of feels like a snake eating itself. So, it’s

[00:07:53] Abby Cabunoc Mayes: Ouroboros.

[00:07:53] Jimmy Shen: we’re, yeah, because we’re using ridiculous amounts of computing power to run these quantum chemistry calculations to predict the properties of semiconductor materials to then eventually make better chips to then eventually do more quantum chemistry calculations. So, so,

[00:08:13] Abby Cabunoc Mayes: fun, okay.

[00:08:15] Jimmy Shen: So, semiconductor by themselves, they’re actually pretty boring. But when they have dopants and defects in them, that’s when all of their kind of interesting properties come out . And then the entire semiconductor industry, a lot of their control and manipulation of silicon up until this point is through manipulating the defects and dopants.

So a lot of the important properties of high quality semiconductor materials are 90 percent determined by the defects.

[00:08:50] Arfon Smith: So it’s a defect because it’s not a sort of perfect, uniform molecular structure or something, but it’s actually a designed defect. Defects are good

[00:08:58] Jimmy Shen: Yeah, so

[00:08:59] Arfon Smith: are better than others. Presumably, is that the right way to think about it?

[00:09:02] Jimmy Shen: Yeah, so the real way to think about this is like you imagine you have silicon. There’s four electrons for every silicon. You know doing atomic orbitals in high school, right? The atomic some of them are filled some of them are unfilled and then when you kind of squeeze all the atoms together, all of these orbitals they interact with each other, but they still look kind of the same, right?

You have filled orbitals and you have unfilled orbitals. That’s kind of what’s happening in a pristine crystal. And if you have silicon and then you introduce something else that usually has one extra electron in the silicon, you introduce that into the lattice, then that extra electron, it would tend to go into one of the previously unfilled orbitals, and there’s a gap between these orbitals.

So now he’s kind of stranded by himself up here. So the important difference is when you’re completely filled or completely empty, the electrons aren’t able to move. It’s really when you have this like little bit of space, either when you remove an electron from these filled bands or you introduce an electron into the unfilled bands, they’re really free to move because they have nothing else to interact with in a very hand wavy argument. So all of the electrons that are actually being carried throughout your devices, they come from defects and dopants in them. Like if you had completely pristine silicon without you pumping a huge amount of electricity into them, it’s not gonna have any mobile electrons.

So yeah, so that’s where the general interest in defects come from, right? And then a lot of materials development in the semiconductor industry is really focused on kind of understanding and manipulating the charge carriers in these defects. So basically, yeah, having a database of these basic properties would be really, really useful in that area.

[00:10:57] Abby Cabunoc Mayes: Okay yeah that makes a ton of sense. Thank you for that throwback to highschool.

[00:11:00] Arfon Smith: I was going to say, yeah, I’m drawing atomic shells and valences in my head as you were speaking. So I’m seeing it. So you talk about this sort of database of different defects and their properties, like who’s the target audience, I guess, for the software, but maybe also for the outputs of the software. Could you just talk about what audiences are you serving with this work?

[00:11:18] Jimmy Shen: Yeah, sure. So, the final database, whatever form it takes, we’re still In the process of figuring all of that out, because, these are some of the most expensive workflows that we developed in terms of computational cost. So how much of this we’re going to run and where it’ll eventually get hosted, we’re still not sure. But we can use kind of, examples from kind of previous types of work, so if we think about cathode, like battery cathode materials. So that’s something that gets hosted on the Materials Project. And that really is being used by just a lot of material science researchers across the world. So ideally there will be one component of this, which is just a basic search engine type situation where you come in and you filter for properties and quickly access some information that you need, right?

And then there’ll be another part of this where you programmatically access a lot of these properties, and then try to do computational research on top of them. Whether that’s, AI focused , building more advanced machine learning models for these predictions, or whether you’re just interested in tweaking these because you might be just doing research on a particular material. You might want to tweak these calculations to fit your own needs, right?

So those are the kind of two end target audiences for the data. And the target audiences for the code is basically everyone who’s trying to do the same type of stuff that I am. So when I was doing my PhD, you could get a decent chunk of your PhD done by doing a handful of these calculations, and now we’re in the process of scaling this up to tens of thousands or hundreds of thousands of money permits.

[00:13:17] Arfon Smith: Nice. And it also makes sense why LBL is involved, right? I mean, a place where there’s lots of computers computationally intensive sort of research is often is there often something in the U. S. that the national labs are very involved in, right?

[00:13:30] Jimmy Shen: yeah, you can’t really do this at like normal university scale.

[00:13:35] Abby Cabunoc Mayes: I did have a question about alternatives. Why would a researcher choose this piece of software rather than something else?

[00:13:41] Jimmy Shen: so I think the main reason for choosing this is, like, it’s completely integrated with all of the other tools. So there’s actually been work within our group that have tried to tie these defect simulations with the automation work engines that we have, right.

And that proved to be extremely difficult. And that ended up not combining due to various technical reasons. So over the last six or seven years all of those technical hurdles came down, right? So we gradually got rid of basically hurdles related to object storage, hurdles related to manipulating volumetric data. So, our general workflow engine which It’s this Jobflow thing that was recently published, that got a lot better. So it went from basically you having to learn a new system to design all of your workflow with. The new workflow engine is really just you write general Python code with some caveats. You put the right decorators on things and then everything just ties together automagically. So it’s a lot easier, especially when you’re writing really complex workflows,

Things that were essentially impossible with the old stuff, it becomes almost trivial in the new way of doing things. So it’s when those technical hurdles came down, we’re like, okay, yeah, there’s enough tools here. We can actually build this.

[00:15:11] Arfon Smith: I’m just going to ask about the sort of broader ecosystem. I mean, it sounds like you’re working on a really rich set of packages and capabilities in the Materials Project, but how does this project fit in the sort of broader open source ecosystem? Like, what are the important dependencies that you have that you aren’t working on yourself?

Like, obviously Python’s a big dependency for you all, but like, what are the things out there in the ecosystem do you sort of take? As a hard dependency for the project? Can say a A bit about that?

[00:15:41] Jimmy Shen: Yeah. So one of the packages here is called Monty. This was developed and named, I think, around 2011. So the fact that they called it Monty, it kind of gives you an idea of like how crucial they think a lot of the capability is.

[00:15:59] Arfon Smith: Yeah,

[00:15:59] Jimmy Shen: basically like that’s just a extension toolkit to Python that we use at every step of the process. So that’s a pretty hard dependency, and that’s just really like a lot of tools for serializing and deserializing data, and processing data in a way that’s commensurate with our very MongoDB and very JSON focused data workflow. So I would say like that’s the sort of root of this kind of very, very big tree. And then on top of that, you have pymatgen, which is the base package that does a lot of the object definitions that a lot of these later packages that are focused on quantum chemistry rely on. And then from that, I think there’s a lot of different things that have branched out, some of them affiliated with Materials Projects, some of them not. There’s a lot of action happening in this space. And basically pymatgen and Monty are kind of at the root of this branching tree of packages.

[00:17:08] Abby Cabunoc Mayes: Nice. Are you using MongoDB to build this database of defects?

[00:17:12] Jimmy Shen: Yeah, so.

[00:17:14] Abby Cabunoc Mayes: like MongoDB.

[00:17:15] Jimmy Shen: yeah, the, actually we were always on MongoDB but one of the things that we kind of realized was a barrier for a lot of the work a few years ago was the fact that MongoDB alone wasn’t enough. So we actually natively now support this Chimera storage scheme where like everything looks and feels like MongoDB at a top level, but like lower down, large objects are kind of automatically serialized into, AWS storage, so they’re automatically put there, but then at the top level from query, it behaves as if it’s still a MongoDB object. We want to make sure the scientific researchers aren’t spending too much time worried about the flow of data and everything, so that way if you set up this one config file properly, you should be able to just perform queries and have a combination of MongoDB and AWS S3 take care of everything on the back end.

[00:18:15] Arfon Smith: For sure. I wanted to switch tracks a little bit and just talk about JOSS for a second and ask firstly, why did you decide to publish there and how did it go?

[00:18:23] Jimmy Shen: Yeah. so basically I heard really, really good things from my friend about the review process for JOSS. It’s open, so that’s really a brush of fresh air for junior researchers because it feels really nice to just have that process be out.

And the outcome of the process is you know, I think a lot of us care a lot about the overall quality of our code. So everyone that I’ve talked to has said that they were able to improve their code quality significantly just as part of this process, because it’s really hard to get another person to look at a repo that carefully. Usually, right? So that was just an opportunity that I didn’t want to pass up.

[00:19:12] Arfon Smith: Nice. Great. I’m glad to hear that worked out well for you.

[00:19:15] Abby Cabunoc Mayes: Yeah. And back to the software itself if other people want to contribute to pymatgen analysis defects, how can they help you? What skills do they need?

[00:19:23] Jimmy Shen: Yeah so I think like anyone that’s interested in defect science and anyone that’s interested in materials in general, I think there’s always things that they can do to help, right? There are tools within here that are not actually defect specific. Like one of the main, workflows that we use to discover new cathode materials , it utilizes tools that are built within this package, right?

So, that was something that existed previously. And then we kind of realized conceptually that this was the best place to put it. So yeah, anything in the pymatgen ecosystem could always use more help.

And then if it’s related to point defects in any kind of material.

So if you care about an isolated atom in a crystal for any reason, I think a good place to contribute some of that code is, within this namespace package by itself. Right, but the more important thing is just people invest in the pymatgen ecosystem in the first place.

[00:20:27] Abby Cabunoc Mayes: Yeah, and I think especially with the broader Materials Project, I think that’s a good call to action.

[00:20:32] Jimmy Shen: Yeah.

[00:20:32] Arfon Smith: Jimmy it’s been really fun to hear about your work today and the whole collaboration, the whole ecosystem of tools you’re actively developing in. How can people keep up to date with your work either in open source or academically more generally, do you have places you would want people to to follow you?

[00:20:50] Jimmy Shen: Yeah. So I have a Google Scholar. I have a Google Scholar and I have a GitHub. I think if you’re a researcher, that’s kind of what you live out of.

[00:21:00] Arfon Smith: Yep. That sounds like a good combo. Yep.

[00:21:03] Abby Cabunoc Mayes: Yeah, well, we’ll put both of those in the show notes.

[00:21:05] Jimmy Shen: Yeah.

[00:21:05] Arfon Smith: Absolutely. Again, thank you so much for your time today. It’s been really fun to learn about your work and thanks again for sharing pymatgen-analysis-defects with us. It’s been really, really nice conversation.

Thanks. Thank you again for your time.

[00:21:18] Jimmy Shen: Alright, thank you guys.

[00:21:19] Abby Cabunoc Mayes: Thank you so much for listening to Open Source for Researchers. We showcase open-source software built by and for researchers, you can hear more by subscribing in your favorite podcast app.

The Journal of Open Source Software is a community-run journal relying on volunteer effort. If you’d like to support JOSS, please consider making a small donation towards running costs at numfocus.org/donate-to-joss. That’s N U M F O C U S.org/donate-to-J O S S.

Open Source for Researchers is produced and hosted by Arfon Smith and me, Abby Cabunoc Mayes. Edited by Abby and music CC-BY Boxcat Games.

[/expand]

Video

JOSSCast #6: Streamlining Molecular Dynamics – Marjan Albooyeh and Chris Jones on FlowerMD

Subscribe Now: Apple, Spotify, YouTube, RSS

Marjan Albooyeh and Chris Jones chat with Arfon and Abby about their experience building FlowerMD, an open-source library of recipes for molecular dynamics workflows.

Marjan and Chris are both grad students in Dr. Jankowski’s lab at Boise State University where they use molecular dynamics to study materials for aerospace applications and organic solar cells. Before joining the lab, Marjan was a machine learning researcher. Chris learned to code when the lab’s glove-box went down and he never looked back.

You can follow them both on GitHub: Marjan @marjanalbooyeh and Chris @chrisjonesBSU.

Episode Highlights:

  • [00:00:12] Introduction to FlowerMD and its creators, Marjan Albooyeh and Chris Jones.
  • [00:02:54] Explanation of molecular dynamics simulation process and applications.
  • [00:05:10] Insights into FlowerMD’s development process and design goals.
  • [00:09:26] Target audience for FlowerMD and its usability for researchers.
  • [00:11:30] Importance of reproducibility in research facilitated by FlowerMD.
  • [00:16:18] Challenges faced building FlowerMD.
  • [00:19:52] Experiences with the JOSS review process.
  • [00:22:39] Contribute to FlowerMD!
  • [00:24:08] Future plans for FlowerMD and building a research community.

Transcript:

[00:00:05] Abby Cabunoc Mayes: Welcome to Open Source for Researchers, a podcast showcasing open source software built by researchers for researchers. My name’s Abby.

[00:00:11] Arfon Smith: And I’m Arfon.

[00:00:12] Abby Cabunoc Mayes: And we’re your hosts. Every other week we interview an author, and in this case two authors, published in the Journal of Open Source Software, or JOSS.

This is episode six, and today we’re chatting with Marjan Albooyeh and Chris Jones about their paper FlowerMD: flexible library of organic workflows and extensible recipes for molecular dynamics.

So it was great to hear about their work.

[00:00:33] Arfon Smith: It was.

[expand]

[00:00:35] Arfon Smith:It’s a really tidy submission had a really fast review, and I think rightly so, I think it’s really well put together piece of software.

[00:00:41] Abby Cabunoc Mayes: Yeah, and it’s interesting how it’s taking a lot of different open source software in this field, and then, links them all together into FlowerMD. So that people can just really focus on the science, they don’t have to think about building this workflow every single time. They really thought about making this useful for others.

[00:00:57] Arfon Smith: Yeah, Marjan talks really nicely about the process they went through to really design the software and get the right sort of abstraction so that it was super usable for their community of users.

So I think that’s our first student submission on JOSSCast, right?

[00:01:12] Abby Cabunoc Mayes: I believe

[00:01:12] Arfon Smith: students have been involved, but I think this is it.

I, personally, I learned to write software during my PhD. Uh, I don’t think it would have been worth publishing in JOSS bluntly. Uh, but, honestly, but, um, I think, you know, It’s cool to hear about some grad students producing some nice tools for their research area and publishing in JOSS.

[00:01:32] Abby Cabunoc Mayes: .So yeah, excited for you all to listen to the interview.

[00:01:35] Arfon Smith: yeah, let’s jump in.

[00:01:36] Abby Cabunoc Mayes: Marjan and Chris, do you want to give yourselves a brief intro before we dive in?

[00:01:39] Chris Jones: Sure. So, we’re both students in Dr. Eric Jankowski’s lab at Boise State University. We’re in the College of Engineering in the Micron School of Material Science and Engineering. In this lab we use molecular dynamics to study materials for aerospace applications and also looking at materials for organic solar cells.

[00:01:58] Arfon Smith: cool. And what about yourself, Marjan?

[00:02:00] Marjan Albooyeh: Hi, I’m Marjan.

I’m a second year master’s student here at the Computational Materials Engineering Lab at Boise State. My previous background was in computer science. But I decided to continue my academic research in material science, so that’s why I joined this lab. And we’ve been doing molecular dynamics research for most of the time combining it with like other tools, like machine learning and analysis.

[00:02:26] Arfon Smith: Very cool. Well, welcome to the podcast. I guess I wanted to kick us off by just checking that everyone knows what molecular dynamics is because I think I know and I could guess an answer, but I’d love to hear you both just explain a little bit of background about what molecular dynamics is, when researchers might use it when maybe you wouldn’t use it.

Like, what’s the elevator pitch for molecular dynamics and why it is a thing that’s used, for example, in material science?

[00:02:54] Chris Jones: Okay. So, there’s lots of different simulation methods that scientists and engineers are interested in, and they all span an array of time and length scales. And with molecular dynamics, we’re kind of down in the nanosecond to microsecond and the nanometer to micrometer length scales.

And with MD, we’re simulating particles, so atoms and molecules and their bonds between the atoms.. And the main thing here is that we’re just using classical physics, so we’re not making any considerations for quantum mechanics or anything like that. So we just use Newton’s equations of motions, and we have all these forces and interactions between the particles, and we can calculate the positions of particles and how that evolves through time.

[00:03:35] Marjan Albooyeh: Yeah, so I think in general, molecular dynamics is a great tool to see the evolution of the particles in a box over time see like how the particle moves and what kind of shape or like structure they end up getting after you start your simulation and the cool thing about the molecular dynamics engines that are out there is you can tweak parameters.

You can change parameters related to your simulation and see how those changes actually affect the final output of your trajectory.

[00:04:05] Arfon Smith: And so if I understand it correctly , you wouldn’t use molecular dynamics to simulate a single molecule. That would be more like a quantum chemistry package like density functional theory or some other, you know, super, where you care about electrons and all the quantum states.

So it’s not that, it’s a bigger systems than that. Is that fair?

[00:04:26] Chris Jones: Yeah, that’s right. And then and then it’s smaller than other simulation methods, like finite element analysis, where you might simulate like a shape, like an airplane wing or something, and how that interacts with airflow or something like that. So like things that large, you cannot use MD for.

[00:04:43] Arfon Smith: Okay. And, and is this very standard in like material science? Is this a thing? Is this the sort of standard at this kind of resolution of detail? That allows you to understand the properties of materials? Is that why it gets picked?

[00:04:54] Chris Jones: Yeah, it’s a popular choice. There’s also Monte Carlo, which can study the same levels of length and time scale, but it’s a different algorithm that it follows. I think which one is better than the other kind of depends on the application and the questions you’re trying to answer.

[00:05:08] Arfon Smith: Well, thanks for the background. That’s really interesting.

[00:05:10] Abby Cabunoc Mayes: And back to FlowerMD, this is a super cute name um, I believe. I believe the acronym was spelled out in the paper title but can you tell me a little bit about why you started this project, and maybe how you came up with the name. Did you purposely try to spell flower? I always wonder.

[00:05:27] Chris Jones: I guess it’s a backronym. We wanted a nice acronym and flower was cool. So we kind of forced it to fit the features of the software.

[00:05:37] Marjan Albooyeh: We changed the name.

[00:05:38] Chris Jones: We changed the name a couple of times before we settled

[00:05:41] Arfon Smith: Side note, this seems to be a pattern in all areas of research. You’re like, God, that’s really close to this word that would be cool. Okay, how can we change this?

[00:05:51] Chris Jones: Yeah.

[00:05:51] Marjan Albooyeh: Yeah. Right before we started to actually find the name, we were like, we are not going to do acronyms. It’s just like, something that everybody does. Let’s do something like cool. But at the end we were like, okay, maybe this acronym works better because it tells exactly what we want it to tell.

So, yeah.

[00:06:09] Chris Jones: So why we started the project. An anecdotal experience that we experienced in this lab is some of the students before me had their own research project they’re working on, and they created their own package that kind of puts the different pieces together to allow them to study and answer the questions they want to answer. And it was a great package.

And then, when I started, I started with a new project that was looking at similar kinds of things, but trying to answer different questions. What ended up happening is that a bunch of the choices that the previous students made when designing their package were kind of ad-hoc to their project.

So a lot of the things in that package I couldn’t use for my project. So it’s kind of in a spot where I had to start over from scratch to make my own package to work on my project. And so I did that and I was working for a while and then another student joined the lab with another project that was similar, but just a couple of things were different.

And again, I kind of made the same mistake in my package where I just made all these assumptions about the project I was working on and hard coded into this workflow. And so we were faced with the possibility of having to write a third package from scratch, where, again, a lot of the stuff is kind of the same from one package to the other.

So before we did that we just took a step back to think about, okay, how can we write a package that’s doing these types of workflows and experiments that we want to do, but doesn’t make any assumptions about the questions of the project or the goals of the project. So we started over with Flower and I think Flower would work now for any of these three projects we’re working on and others.

So it’s kind of the design goal for Flower. And I think the story that we experienced is something that’s very common in academic research. Especially computational academic research where everyone’s kind of developing their own tools at their desk. A lot of these workflows and tools aren’t being shared.

There’s not a lot of standards in place for some of these things. And we just end up with a lot of repeatable work, and it’s not necessarily

[00:08:08] Abby Cabunoc Mayes: I really love that, just open sourcing a tool that you can use multiple different ways, but also can help other researchers use multiple different ways, and they can focus on the science and not have to rebuild something over and over

[00:08:18] Chris Jones: yeah, exactly.

[00:08:19] Arfon Smith: In your domain model, then, for FlowerMD is each of those three scenarios now a recipe? Like, an extensible recipe? Is that what they would be?

[00:08:27] Chris Jones: Yeah, that’s kind of the idea with the recipes. So the design of FlowerMD is we kind of created this base that any project might need. So any project might have to create molecules, any project might have to arrange these molecules in some box to create an initial configuration, and any project’s going to have to run a simulation.

So we created these base classes. Where these things are kind of all separate, isolated parts of the package. But then how you put them all together is the recipe part of it, I guess.

My project is the thermoplastic welding recipe that’s in Flower right now. The previous project I mentioned it wouldn’t necessarily be a recipe, but all the tools are there right now for them to be able to do the work that they had done already.

[00:09:14] Arfon Smith: Yeah, that makes sense. So who’s your target audience for this software? Is it people designing materials? Like what’s the sort of typical job titles of people who would, or research backgrounds of people who would be using your software?

[00:09:26] Marjan Albooyeh: So I think like any researcher could be undergrad, graduate students, or postdocs who are doing research on molecular dynamics, specifically on organic materials. Basically anyone who likes to see this evolution of particles in their material they can definitely benefit from FlowerMD.

And I think one of our goals was to build it in a way that it’s kind of clear and easy to understand for those who are not necessarily very familiar with like programming languages or using like software packages. So because under the hood we are using like multiple different libraries and packages to kind of like do all of these processes from Initiating this structure to like assembling a box and like running the simulation. But each of those has its own learning curve.

But the goal of the Flower was to put all of these in one place with like clear instructions and documentation. So a researcher who just wants to get the code running and like actually running experiments rather than spending so much time on understanding each of these packages and learning how to work with each of them individually. They can just get to the part of running the experiments and analyzing the results rather than spending so much time on building the code, actually.

And I think our target audience also could be people who are passionate about doing reproducible research, meaning that they want to share their work with everyone, everybody else in the community, and make sure that everyone can also reproduce the exact same results that they had in their papers or in their work.

[00:11:01] Arfon Smith: Yeah, I think what I can hear you saying is, and this one of the messages I used to try and still try and promote for research software and investing in good research software, is if you do the work to understand those abstractions of the problem, you can actually do science faster and you can repeat it more easily. And, cost per repeat of doing the experiment drops through the floor once you’ve got a nice set of tooling.

Better research through better software is one of those messages that I think lots of people who learn about what can be done if you invest in good software learn that value. But you kind of have to do the work to realize the value.

And then, I don’t know, it’s kind of a bit of a chicken and egg problem for some people, I think, but definitely sounds like an example of creating real value for a whole bunch of researchers.

[00:11:47] Abby Cabunoc Mayes: So it does sound like there’s a lot of open source software in the field that you’re leveraging, which is really, really great to see. Are there any alternatives to FlowerMD specifically, that people could be using, and what makes this different?

[00:12:00] Chris Jones: So I guess in response to all the open source software tools that are out there there’s definitely a lot. I think there’s a lot of people in this field that care about creating reproducible open source software. For example, there are two kind of packages that we leverage very heavily.

So one’s called MoSDeF or the molecular simulation design framework. And this is a package or

a group of

[00:12:22] Arfon Smith: brilliant name, by the way. Sorry, I was just like, that might be the best named software package ever. Sorry.

[00:12:28] Abby Cabunoc Mayes: Another backronym, but

[00:12:30] Chris Jones: Yep. So they created packages for building molecules arranging molecules in a system, applying your force field parameters to it. And then there’s another package called HOOMD-blue, which is a simulation engine that runs on GPUs.

So, there’s lots of packages out there that give you bits and pieces of what you need to do, but like Marjan was saying, it’s really challenging to put them all together, create a single workflow that you can use to start doing science. So that’s a big part of what Flower is.

We’re not trying to recreate the tools that are already out there.

There are some alternatives that do a similar kind of thing. For example there’s one called Radonpy where you kind of have the same steps of going from starting with a molecule and running a simulation.

They kind of focus more on calculating properties from their simulations. With Flower, we’re trying to create recipes for more complex workflows. For example, welding two chunks of polymers together, this really requires running multiple simulations. Taking the output of one simulation to build another new system, use that as the input for the next simulation and similar with the surface wetting kind of recipe we have in there where it’s multiple steps of build a system, run a simulation, build a system, run a simulation.

[00:13:45] Arfon Smith: Interesting. Yeah. It sounds like there’s a pretty rich ecosystem out there, as you say, and you’re sort of gluing all these different pieces together, which sounds really useful. Side note, I think your field has some of the best named software out there. So I’m just going to now go and like try and learn some more about that.

There’s clearly a high bar for naming software packages in your area which I appreciate. I was wondering if I could ask, what are some of the challenges you faced when building Flower? Any problems you encountered that you didn’t expect to?

[00:14:12] Marjan Albooyeh: I think the design phase of our project was probably the most challenging one. I remember we had to kind of like iterate over the design of our base modules a couple of times. And the reason for that was We kind of wanted to accommodate all of different projects that might come in the future.

Not only our projects because we were thinking about, okay, we can build this module for initiating the molecules in this way that works for our projects. But then we had to kind of come up with like an imaginary user who in the future might want to use this in a different way. Maybe they already have parts of their structure built.

But can they, use FlowerMD in the middle step of, like, getting the simulation running? So, yeah, we had to come up with these different scenarios, then, and how can our design kind of, like, accommodate that. And we also wanted to keep it simple, especially the API. We wanted to keep it as simple as possible so people can just look at the parameter names and they just understand what it does and it’s clear enough for them, they don’t have to go through the code and understand what each parameter does.

So it was a trade off of being flexible enough to serve different projects but at the same time be simple enough for people who might not have a strong background in like computer science or programming. So I think that was the challenging phase of our project. We spent quite a lot of time just designing the main modules without even coding, just like talking about it, but then we code it and then we say, oh no, this is not good enough, we should do it again.

Yeah, so after that phase, I think the rest of the challenges was making sure that we are following like best software development practices. We added unit tests, integration tests, making sure that all of our functionalities also work in bulk simulations, not just, like, the toy models that we were, like, working with.

Yeah, those are mostly the challenges that I think every software engineer might face during the production.

[00:16:18] Arfon Smith: Yeah, it sounds like you have a classic problem there, which is the balance between usability and sort of feature completeness. And finding those right abstractions is often a big part of the discovery process building tools.

[00:16:30] Abby Cabunoc Mayes: So how can people get started using your package?

[00:16:33] Marjan Albooyeh: So we made sure that our package is installable through conda. So if people are familiar with conda, they can just get our software from conda-forge and start working on it. We try to make it as seamless as possible. So you don’t have to install a lot of requirements Just FlowerMD should be fine.

And we also added a couple of tutorials in form of Jupyter Notebooks to our repository, basically like going through the basic functionalities of FlowerMD line by line with some text description of how it works and what each of the classes in the FlowerMD package can do.

And the tutorials that we made, we ordered them in a way that starts with, like, basic functionalities so we don’t scare people off and with the complexity, and then slowly adding more complex features and how they can use these building blocks that we introduced to start their own projects or run the simulations that they want.

And we also made sure that we have good documentations. We wrote Docker strings for all of our functions and everything is available in our repository. And we are very happy to answer any questions. If anybody wants to start using this project, they can just start an issue on GitHub and we would be happy to help them to get started.

[00:17:54] Abby Cabunoc Mayes: I love that. Yeah, it sounds like you put a lot of work into making it really easy for people to use this, so thank you.

[00:17:59] Arfon Smith: Yeah, I was actually just looking at the JOSS review, and it looks like you, your review was in, like, two months from submission to being accepted, which is very fast. Our average is actually about a hundred days but still, two months is quite great. And often those packages that go fast are the ones where people have got a really nice software package where they’ve basically done everything that we would already want you to do, so people just whiz through the review.

I was curious if you have anything to say about the JOSS experience? Did you know upfront you were thinking about submitting to JOSS, curious how you learned about the journal, why you submitted there, anything you’d like to say about that experience would be fun to hear about.

[00:18:39] Chris Jones: Yes, we decided to submit to JOSS because we wanted to write a paper where we talked about the problems that we were trying to solve and just focus on the features and the software. We weren’t trying to publish results of these simulations or anything like that.

So we just wanted to get this, the software design out there and talk about our motivation for designing the software package and what problems it was solving.

So I think JOSS is a great place for that, where you can just focus on the software, not necessarily the tasks that it’s designed to do and start talking about results and things like that.

[00:19:11] Marjan Albooyeh: Yeah. And I think we kind of love the JOSS review process as well. It was kind of seamless and easy.

Everything was clear. Everything was happening through GitHub so we could actually communicate with our reviewers unlike other journals that you never have a channel to talk to reviewers and everything is like so formal. But in JOSS the great thing was we were able to we could actually talk to reviewers, they can tell us whether they were able to run the code without any problems, and then going through the next step of the review.

So that’s, I think, what made the JOSS review process kind of easy for us, and I hope for the reviewers as well.

[00:19:52] Arfon Smith: Yeah. Our tagline is a developer friendly journal. So hopefully that resonated for you. It sounds like it. It was a good fit.

[00:19:59] Marjan Albooyeh: Yeah.

[00:19:59] Abby Cabunoc Mayes: And what did you learn by going through the JOSS review process?

[00:20:03] Marjan Albooyeh: Yeah, the JOSS review process was quite easy for us. We, because we knew that the reviewers are going to run our code, so we had to make sure we have tutorials and a running examples before we submit. So that was the requirements that we put for ourselves and it was a good way to test our software with the reviewers who never worked on this field necessarily.

And the interesting thing that happened during the review process was by the time of the submission we only had one recipe on FlowerMD, and that was the, the polymer welding example. But in the paper, we mentioned that FlowerMD could be used for another processes, like surface wetting, where you, like, actually put, a droplet on the surface, and you see how it expands.

And one of the reviewers suggested that, oh, it would be nice to have this in your package, as you mentioned in the paper. And we were like, maybe we should try it. And because of the design choices that we made, we were able to build this surface vetting module during the review process, relatively fast, it just took us like a couple of days with testing and everything.

And we were able to merge it before the review process finishes. And I think that was a good example of this sort of like review process that the reviewer asked us something and we said, okay, let’s do it. And we did it. And they were able to see the results before accepting.

[00:21:30] Arfon Smith: Super cool.

[00:21:30] Abby Cabunoc Mayes: That’s amazing.

[00:21:31] Arfon Smith: Nice story. So how can others join in and contribute to this package? Is it writing new recipes extending core functionality? What kind of contributions are you looking for Flower?

[00:21:42] Chris Jones: I think either one of those would be great. I think the main thing right now is just if there are new researchers or experienced researchers out there that might be starting a new project the best way to help is just to try to come use FlowerMD and see if it works for your use case. If there are any issues you run into, open an issue on the GitHub page.

But yeah, specifically, FlowerMD is far from what we envisioned as a finished package. Cause like the name suggests it’s a library of recipes and right now we only have two. So yeah, if someone has some project, I would definitely encourage them to at least maybe at first just come give Flower a try, see if you can leverage kind of the design mod and modules that we have in there in these base classes that we use and see if you can design your workflow from there.

And then, if it works out okay, then make a PR to contribute your workflow to the library.

[00:22:33] Abby Cabunoc Mayes: And just a real quick follow up, what skills does someone need to know to create a recipe, or what languages would they have to know?

[00:22:39] Chris Jones: So Flower is completely written in Python. The simulation engine we use HOOMD-blue is written in C++ and runs on GPUs and has a very nice Python API. That’s one of the reasons we decided to use it as the engine in our package. So you don’t have to know C++ or CUDA or anything like that.

It’s just purely Python. And we think we designed the base classes in a way that it does not really require any like advanced coding skills to start writing a recipe and make a PR for one.

[00:23:09] Marjan Albooyeh: Yeah, and we would be happy to help through the process if someone wants to contribute, they can open the PR, and we’re definitely going to review it and give them feedback or help them with building the code.

[00:23:20] Abby Cabunoc Mayes: That’s great. So what’s the future of Flower? What’s next?

[00:23:24] Marjan Albooyeh: So the future for FlowerMD, I think, as Chris mentioned, it would be really nice to see other researchers using it and start adding new functionalities to it or adding their own recipes. That’s how, I guess, the Flower blooms. So, yeah, I think that would be the best use case for FlowerMD in the future.

And the other thing is, I think we both love to see it as a platform that people who are working on similar projects can use and share their results. So it would be easier to get the same results on FlowerMD getting like reproducible results. So we are hoping to gather a little community of researchers who are working in this field with FlowerMD.

[00:24:08] Arfon Smith: Yeah, it does seem like there’s a great opportunity to build community around this software, especially the way that you’ve designed it for that extensibility. So that’s really nice. And I wish you good luck with that. And so, just to wrap us up, like, how can people find you both online? What’s the best way to keep up with your work?

[00:24:25] Chris Jones: So for me it’d probably be GitHub. I’m not active on Twitter or anything like that. So yeah, my GitHub username is @chrisjonesBSU. I’m sure the link for flower will be in the show notes. It’s the cmelab/FlowerMD on GitHub. Yeah, so, people can feel free to open issues on there if they wanna discuss it or reach out to me via email or something like that.

[00:24:45] Marjan Albooyeh: Yeah, I think they can check out the. Flower MD GitHub page. We are releasing new versions every couple of weeks or a few months. I think that’s the best way to keep up with the package.

And you can also follow my work through my GitHub page. Yeah, it’s my first name and my last name. And you can also check out other projects in the CME lab that we are working on. We have a couple of other cool packages for the similar use cases that people can check out.

[00:25:12] Arfon Smith: Awesome. That sounds easy. And I’m sure people will do that. So thank you so much for your time today. It’s been really fun to meet you both and learn about the work you’re doing here. And I just wanted to say thanks for taking time to be part of the JOSSCast and telling us about FlowerMD.

[00:25:28] Chris Jones: Yeah, of course. Thanks for having us. Thank you.

[00:25:32] Abby Cabunoc Mayes: Thank you so much for listening to Open Source for Researchers. We showcase open-source software built by researchers for researchers. You can hear more by subscribing in your favorite podcast app.

The Journal of Open Source Software is a community-run journal relying on volunteer effort. If you’d like to support JOSS, please consider making a small donation to support running costs at numfocus.org/donate-to-joss that’s N U M F O C U S .org/donate-to-joss.

Open Source for Researchers is produced and hosted by Arfon Smith and me, Abby Cabunoc Mays, edited by Abby and music CC-BY Boxcat Games.

[/expand]

Video

JOSSCast #5: Rewrite in Rust – Gui Castelão and Luiz Irber on Gibbs SeaWater Oceanographic Toolbox in Rust

Subscribe Now: Apple, Spotify, YouTube, RSS

Gui Castelão and Luiz Irber join Arfon and Abby to discuss their work implementing the Gibbs Sea Water Oceanographic Toolbox of TEOS-10 in Rust, and the role of instrument builders in science.

Gui is an oceanographer at the Scripps Institution of Oceanography, previously part of the Instrument Developing Group. Luiz is a Computer Science PhD at UC Davis and an avid Rustacean.

Gui’s website: www.castelao.net.Luiz’s website: LuizIrber.org

Episode Highlights:

  • [00:01:55] - Introductions to Gui and Luiz and how they got into open source
  • [00:09:18] - GSW-rs – a Rust implementation of the Gibbs SeaWater Oceanographic Toolbox
  • [00:17:13] - Why Rust?
  • [00:19:53] - Instrumentation culture in oceanography (and astronomy!)
  • [00:23:17] - Recognition and credit for software contributions in academia
  • [00:27:47] - Publishing in JOSS
  • [00:31:36] - Contributing to GSW-rs

Links Mentioned:

Transcript

[00:00:05] Arfon Smith: Welcome to Open Source for Researchers, a podcast showcasing open source software built by researchers for researchers. My name is Arfon.

[00:00:12] Abby Cabunoc Mayes: And I’m Abby

[00:00:13] Arfon Smith: And we’re your hosts. Every other week we interview an author, or authors today, published in the Journal of Open Source Software. Also known as JOSS.

[00:00:21] Abby Cabunoc Mayes: yeah, This is episode 5. We spoke with Gui and Luiz, and you’ll hear a longer intro about the two of them. But I thought this was a very fun conversation.

[expand]

[00:00:28] Arfon Smith: It was fun. I think we’ve learned as budding podcasters that if you invite two people, you seem to talk even more. So it’s a longer one, but it was fun. We covered a lot of ground.

I think one of the things that I learned and reflected on as I was listening, especially to Gui talk, but also Luiz, was there’s this work that gets done, the instrumentation work, the low level software that’s used to operate these incredibly important instruments.

Oceanography is a pretty impactful research area, especially, in these times climate change and all the things, the reasons we need to care about the ocean. This was a really interesting conversation to have and learn about the hardware and the software and how that work is divided up their research areas.

And also just get to hear people who are excited about Rust. Which is kind of fun to listen to people who are passionate about programming languages. What about you?

[00:01:15] Abby Cabunoc Mayes: It was really interesting to hear their story going through this Rust implementation of an existing library. They’re such strong Rust advocates. They were ready to go through it. It was fun also just to hear their journeys through open source and implementing this. And I learned a bit about oceanography. Yeah, let’s play the interview!

[00:01:31] Arfon Smith: Yeah, let’s go for it.

Today we’re chatting with Gui Castelao and Luiz Irber about their paper, Gibbs SeaWater Oceanographic Toolbox of TEOS10 Implemented, I think Importantly, in Rust, although the Importantly bit isn’t in the title, but I inserted that just for my own satisfaction.

Gui Luiz, why don’t you introduce yourself and tell us a little bit about your home institution and whereabouts you’re based?

[00:01:54] Gui Castelao: I’m Guilherme Castelao, just Gui. I’m an oceanographer, currently at the Scripps Institution of Oceanography. And before I was in the Instrument Developing Group, which is related with when this project started . And now I’m working more with machine learning and climate.

[00:02:12] Abby Cabunoc Mayes: Nice and Luiz?

[00:02:13] Luiz Irber: Yeah, so hello everyone I’m Luiz I I’m a computer science PhD at UC Davis. I work with bioinformatics mostly I Been doing Python for 20 years, doing Rust since 2017, and for my PhD research, I just love going into public genomic databases and downloading petabytes of data and making them searchable and useful.

I met Gui when we both worked in Brazil at the Brazilian Climate Research Institute. So I have a history of a bit of climate models , so that’s my connection with oceanography. Nowadays, it’s kind of fruggy.

[00:02:54] Abby Cabunoc Mayes: Oh, that’s great.

[00:02:55] Gui Castelao: Well, now that Luiz mentioned that, that’s interesting. One of the first open source packages that I wrote was an old version of this package in Python. This back like 2002, 2003 was one of my first submissions to PyPI.

[00:03:12] Abby Cabunoc Mayes: Full circle! And I think that’s a nice segue. I was gonna ask a fun icebreaker question. Tell us a little bit about your open source journey. So, Gui, it sounds like you have a bit of a background doing open source.

[00:03:22] Gui Castelao: Yes. So, that started with around 2002, 2003 mostly Python at that time learning how to do it. One thing that’s interesting on this whole process is how the community changed, right? At that time was hard to do open source, like, people didn’t know how to use it well. Sometimes people would write to me complaining, like, really pissed, Oh, there is an error on your open source that you’re making free for us to use. Like it was a selling product, right? I’m like, wait a minute.

[00:03:53] Abby Cabunoc Mayes: that still happens. But, go

[00:03:56] Gui Castelao: Oh, That is much better now. But, I think just the community wasn’t prepared. They didn’t understand how things work, right? They were less material for us as well. Like, for me building stuff, we had to kind of figure it out how to do open source by ourselves.

Now there’s so much teaching, right? So much guidance on the importance of doing documentation or testing or how to handle the community code of conduct, how you should not behave. It was a big change. It’s great. I think we have things that we can improve, but it was a great progress.

[00:04:30] Abby Cabunoc Mayes: Nice, and Luiz, how were you introduced to open source?

[00:04:32] Luiz Irber: So in Brazil there was this kind of big festival which was called FISL, F I S L. It’s the international festival of free software. And it happened on the capital of my state in Brazil, Rio Grande do Sul. So, since I was in high school, like in 2000, I would go with some friends to this festival in the capital and was like 5, 000 people, the largest ones, and just go around. There were booths for the community, like for Linux installations.

And so it was a place that I learned a lot. And then I started tinkering at home with installing Linux in MySphinx and so on. Then in 2005 it was my first time releasing open source software. I started working at another research institute in Brazil that was doing research for agricultural purposes.

I got into it as a research intern, and was kind of rewriting a package that was in Windows, and they promised on the funding that it would work on Linux, but no one had worked on that. And it was a C++ codebase, and I was like, how do I port this thing over to run on Linux? So, that’s when I started using Python and a lot of GStreamer to handle the video components of the system.

I learned Python a bit before, but that was the point where I started really going deep into Python. And then in 2007, I think was the first time I attended the Brazilian PyCon. And that was like, just amazing. Like the Brazilian Python community is so nice. And I just learned so much from the community interactions, but also how to develop software. And because of that, I ended up doing my PhD was because of this Python community connection also. I found Titus not because he was a professor, I found him because he was doing a lot of testing in Python.

[00:06:33] Arfon Smith: That’s super cool. And this is Titus Brown, of course, who’s does a lot of open source work himself. I have a side question. I was going to ask you something different, but I’m going to have to ask you, Luiz, because I think your number’s going to be higher than mine. How many distinct versions of Linux have you installed in your life?

Like, different flavors of Linux? Because you’re describing a period where I feel like I was reinstalling my machine every two weeks with a different distro. I was trying them all out. I’m just curious if you had to, pick a number, how high is that number?

[00:07:00] Luiz Irber: I would probably say 15. The first time I nuked the Windows system in my computer and lost all my family data, was with a Brazilian distribution called Conectiva, which was derived from Red Hat. So,

[00:07:15] Arfon Smith: What about yourself, Gui? Do you have a number? Top of mind?

[00:07:18] Gui Castelao: little bit less than that, maybe like 10, because I would install and reinstall and reinstall. Okay, now I figured it out, and then next week I just install the same one again. But my first, I think my first distro was Slackware, so it was quite intensive, quite a lot of time.

[00:07:35] Arfon Smith: What about you, Abby? Have you, have you got scar tissue from those experiences?

[00:07:39] Abby Cabunoc Mayes: Yeah, I had that period in my life. I think I’m only at two, though. I don’t think I explored too, too far. I really liked Ubuntu, yeah.

[00:07:46] Arfon Smith: yeah, I remember mine’s probably under 10, but I do remember being very proud of an install that I’ve done. Maybe like you, Gui, where you like install it and then you’re like, Ooh, no, next time it’ll be perfect, so you like start again. And then, I went and talked to some of my computational chemistry friends, and they’re like, Well, if you really want to do it properly, you’ve got to install Gentoo.

And I was like, Oh my god, what is this? And you like compile your own kernel to start with. I was like, This is insane! And I remember spending like four days compiling. Like KDE or something, like whatever. And then I, but I’d set the flags wrong and it just didn’t work. I had to go back to the start.

Anyway, that was during my PhD. So, you know, it turns out I had some spare time during my PhD. Anyway, that was a side note.

[00:08:31] Gui Castelao: I love Gentoo. Was the end of my PhD was, was everything that I used. Like, you feel like you’re taking all the juice from the machine. And

[00:08:39] Arfon Smith: exactly,

[00:08:40] Gui Castelao: it’s a mess, it’s a mess to install the first time, but then to keep up, as long as you don’t accumulate too much, just compiling the background. I didn’t feel that much.

[00:08:49] Arfon Smith: Yeah, that’s cool. That’s awesome. Okay, right, well I will bring us back to topic. Sorry for distracting everyone talking about their favourite Linux flavour from 20 years ago or whatever. But that was actually really fun to hear that. So, I wonder if we could just start, going back to this paper, this submission and this software, if you could start by telling us about TEOS-10.

Can you say a bit more about the sort of origin story of this software and why you started the project?

[00:09:14] Gui Castelao: Let me talk about a little bit of GSW and then we move to how we start. So, the GSW is the idea behind, actually, is. We measure a bunch of things in the ocean, right, and there are some properties, for instance, one example density in the ocean is really important to be able to explain and understand why water is going one direction or the other, right? But it’s quite difficult to measure directly in the ocean, so it’s way more efficient to measure all the properties and guess what is the best. Usually, all the instruments that we use measure pressure, conductivity, and temperature, and with that we can guess pretty well what is the density. I guess this is the most common application on oceanography with GSW and what GSW provides which is important to make clear what we did was the software part for the Rust.

But there is a whole scientific C unit behind that actually proposed the relations. Right? This is very important and this is distinct. We build a software, but we have no credit for the science built behind that proposed those equations and those parameters. But what this great large community did was proposing relations.

If I have these three measurements conductivity, pressure, and temperature, what is the best guess for density for that place? And our electronics measured the three things. So that said, GSW It’s a collection of those relations, and one big thing from the old version, the EOS-80, is that it was proposed in such a way that it’s consistent.

Before the regressions were done like independently, so they could have inconsistencies, especially on deeper ocean. So the new one is a fully consistent formulation based on Gibbs. Yeah, I think I’ll stop here. It’s way more than that, but I think it’s enough to give a picture.

How we start is a little bit different, I’ll let Luiz tell this.

[00:11:03] Abby Cabunoc Mayes: Well, before we move on, GSW that’s Gibbs SeaWater

[00:11:07] Gui Castelao: Yes, yes, yes.

[00:11:08] Abby Cabunoc Mayes: I did have a question around how you’re getting these measurements. Is there a ship with sensors out there?

[00:11:12] Gui Castelao: Good question. In the old times, oceanography was initially done with bottles. We just measure first at the surface, and there are special bottles that you could dive them and they will close. In a certain depth, and then you bring these bottles back, so we would sample water from 100 meters, 500 meters, 800 meters, so you could measure, oh, this is the salinity down there.

There are special thermometers that once they flip, they would lock, so you leave it stabilized there. They flip, they lock, and bring it back. Oh, this was the temperature down there. But this is very inefficient, right? Hours and hours to make a few measures.

Nowadays, we work with electronics. And the principle of the electronics that work in big ships, what are called rosettes, and these are like package of instruments, like, it’s huge, it’s huge full of instruments that you lower in the ocean with a big ship, or the instruments that I usually use to work with small robots that will dive by themselves.

The principle is the same, and the common measurements are, as I said, pressure, temperature, and conductivity. They record that, bring it back, and then how we guess. There is just the connection, so we have the electronics. The electronics is still there are errors on that, so we still measure with bottles nowadays that you can bring to the lab and measure with very high precision and calibrate the electronics.

And one of the ideas of this library on the start was to be able to bring more autonomous decision making for the sensors or the robots. So if you wanted to do like, to bring machine learning decision to the firmware of these robots, we need to have a more flexible language, right? Like, like Rust would attain.

Robust, need to be robust, because you just drop these robots, maybe they spend months and months by themselves, so we don’t want bugs on that, like, segmentation fault in the middle So, if you want to bring more decision making on these, we need to build a firmware with a language that had more resource, and then these instruments, they need to make sense of the sensors.

For instance, what is the density, what is the pressure, where I am, you know, do I want to go up or down, and this would be the idea of, can we create a resource to produce a firmware for those instruments or sensors to make more autonomous decisions?

[00:13:32] Luiz Irber: Yeah, so I think the origin story on this is is a bit on like how I found out about Rust. During my PhD still to this day I work on a piece of software called sourmash that is the thing that allows me to do these comparisons on public genomic databases. But a big component on that is that I want to be able to run part of the software in the browser.

And sourmash used to be a Python and C++ code base. And in 2017, the best option was to use Emscripten to bring sourmash to the browser, or to rewrite everything in JavaScript. I tried in JavaScript, didn’t work out because of like how we are dealing with integers. Then I was, okay, if I do Emscripten, I really complicate the build system for sourmash.

And then I was like, oh, there’s this other language called Rust that’s kind of like C++ but has packaging, has like integrated testing in the tooling documentation and so on. So I was like, I’ll try it out. And I I’ve been doing Rust since, I really like the language and then I started evangelizing the language around, because that’s what you do when you learn Rust apparently. I talked with Gui and then Gui started doing his things in,

[00:14:51] Arfon Smith: is that Rust swag that you’ve got there, Gui?

[00:14:54] Gui Castelao: That’s a crab.

[00:14:55] Abby Cabunoc Mayes: yeah. The Rustacians, yeah. For anyone not watching the video, , , and Luiz both holding up their Rust swag. I do have a very soft spot for Rust.

[00:15:05] Luiz Irber: Yeah. So then what happened is that in September of 2020, we started doing weekly Rust meetings to talk about Rust and like figuring out things together. Or like if we found something interesting, like we would show to each other. And then we started working on small projects to work with the language other than the work that we were doing in our jobs.

[00:15:32] Arfon Smith: It does seem like Rust gathers strong advocates, right? It feels like there’s a lot of Julia advocates out there as well, right? People who adopt new languages. Well, I guess new languages only survive and grow if people get excited by them, right?

[00:15:46] Gui Castelao: Exactly.

[00:15:46] Arfon Smith: Full disclosure, I’ve not written a single line of Rust ever, so maybe I should fix that now. It seems like I’m the odd one out on this podcast.

[00:15:53] Gui Castelao: One comment about the story that we start. So I’m an oceanographer, right? Full of oceanography, like undergrad, masters, PhD, everything. And, always enjoying coding but there are a lot of holes in my education on computer science. On the other side, Luiz is a full computer scientist, was cool like at the time of the when we were working with the climate model.

He was recently graduated, but he was teaching us a lot of things, and these meetings was a huge impact, like I imagine how much the scientific community could benefit of this, which was essentially pair coding, right? But it goes all the way from, oh, how do you set up your VI? How do you manage this?

How do you do that? And it turned me so much more efficient. These meetings were way more than Rust and update the week than what was happening, other things, but there was a lot of learning. And I think that the scientific community could improve a lot if people practice more of that.

[00:16:47] Arfon Smith: You’re preaching to the converted here, but I love to hear it said, so I’m with you, I’m with you. Yeah, thanks for noting that. Okay, so I wanted to just pivot a little bit to audience here. So, you know, writing software, it sounds like, solving a problem that you faced yourself you know, you understand the problem well, but who’s your sort of target audience?

Is it people like yourself, Gui? How broad is the user base here for this software?

[00:17:13] Gui Castelao: That’s a very interesting question, so this library in Rust is not directly impacting like the community of Python users. It might one day they flip, instead of based on the C, now flip to be based on our Rust one, but I don’t think this is for the general scientific community.

It can be behind the scenes, right? Like as crunchy number, efficient and robust crunchy number. But the way, how I see this more like, few more developers for sensors or high performance computing, anything. So that’s what Rust brings is special, right? Is the efficiency, which means faster, but also demand less resource.

So if you’re doing a firmware for a small sensor or a small robot that’s very helpful if you’re trying to crunch numbers and fast computer and trying to do that faster. That’s the place. But most of the community nowadays I think uses Python. And I don’t think we are replacing Python. I don’t think we should.

[00:18:08] Arfon Smith: Yeah.

[00:18:08] Gui Castelao: So I think a small niche, but a very important one. And maybe a strong one behind the scenes. We still have a wrapper in Python like we do on the repository. But, yeah, not directly for the large community. What do you think, Luiz?

[00:18:24] Luiz Irber: Well, on the question on the target audience, I mostly work with bioinformatics, so like, I’m not the target audience for this library. So people that are working with oceans or this sort of measurements is the target audience. But as Gui said, there is a big spread also on that, on like, which niches are you covering?

Like, are you doing data analysis or do you want to deploy this, like, on firmware to run something?

[00:18:49] Arfon Smith: Yeah, different software for different situations. Hearing you talk about this sort of picking the right tools for the right problem is actually reminding me a little bit of how astronomy works, where you have like very strong culture of instrument builders, people who are building the thing that will attach to the camera that will then go measure some fundamental constant about the universe. And there’s all the software that gets built for that and in the most extreme settings, those instruments go into space, maybe like deep ocean work I would imagine in oceanography, you know, they’re remote, the instruments.

And then there’s also this sort of separation of, well, that’s not the right tools for the people who are going to want to use that data and do their analysis. And so then it might be much more like scientific Python stack, notebooks, maybe popular machine learning libraries, that kind of thing.

I have a question in there actually, which is: Does oceanography have a strong instrumentation culture? Like, are there people who just build instruments and think about that? Is that a career path that people follow in oceanography? Or do people usually do multiple different things?

[00:19:53] Gui Castelao: Yes, I think there is. When we started this library I was working the instrument development group of Scripps, it’s

like

[00:20:01] Arfon Smith: There you go. There’s the answer.

[00:20:03] Gui Castelao: just dedicated for that, and some of the standard instruments used today, they started there. there is this path. I agree with you. I agree with everything that you said.

And they are quite different. I think it’s quite difficult to do both. There are few people, few geniuses that can walk both ways. But I don’t think normal people like me should aspire that. And I also think one of the issues on that is the way the scientific community recognize this.

Right. And, and by the way, I think that’s important job that JOSS is doing. If you dedicate too much time developing software, improving software like me, I was oceanographer and going towards the software development. If you invest too much time in these to do a good job, of course you’re going to publish less if you’re a normal person.

But the scientific community has trouble to recognize that. We still have. We are improving. We still have some ways, but we are much better already. And I think the contribution of JOSS is critical about that. So there is a way for you to produce a software that’s reproducible that it people can understand how to do it that’s tested and you can have a DOI and recognition.

Look I wasn’t fully around. I was working, I was doing something. Maybe PhD didn’t come with all these papers. But look, all the things that I built that people are using, sorry, I think I went a little bit off your question

[00:21:23] Arfon Smith: But I think you’re touching on some important topics. In astronomy, speaking from my experience watching colleagues, the challenges that software engineers face in academia is quite similar to the instrument builders, which is they’re doing work that can result in papers, but it’s often not immediate, and it’s different kind of work.

Like if you build an instrument that a thousand people use over the next ten years. Over the next ten years they’re probably going to cite you every time. So, oh, the long lead is great. You get tons of citations eventually. But, if you spend eight years working out how to build that instrument, you might publish like one or two conference papers in that very long period.

And so you don’t look like a normal researcher because you’re making something. And I’m always struck by that sort of similarity between software and hardware. And people especially I think face some of the same problems because we built universities around this idea that you’re going to publish research all the time.

And that’s the only way that we can judge somebody’s output. So

it’s interesting hearing you talk about it. Mm

[00:22:29] Gui Castelao: is it could be better the recognition of the leading developer, but it’s hard to hide that kind of achievement, right? The part that I’m most concerned is that those things, they are not built by one person.

There is like a community doing little contributions, right? And these little contributions are the ones that I think disappear. I think we should have a better way to recognize that and give a chance. The point is not the ego side, right? We leave that aside, but what I mean by recognize is give the resource for that person to keep going. Maybe during the PhD we spent one year developing this thing that was an important component of a bigger thing.

We should put some money for this person. Let’s go for another year, produce something else. You’re doing great. Keep moving on. That’s when I mean by recognition. I don’t know if Luiz want to add something about this.

[00:23:17] Luiz Irber: Yeah, I will say that we tried with not sourmash, but another software in the lab to hack this publication system and be like every person that submits a pull request is an author in the paper. And that also led to so much discussion being like, Oh, but this person just fix a typo. And it’s still software centric in a sense, because there’s a lot of people that are also scientists that are not going to program something, but they come with use cases we discuss a lot on how to implement things and they don’t have a pull request to go as an author. Software credit is complicated.

[00:23:54] Arfon Smith: I agree. You used the exact word. My favorite word in life these days is it’s complicated. And I think this is a good example of that.

[00:24:02] Abby Cabunoc Mayes: Yeah, and I’m editing an episode we recorded but hasn’t been released yet, where we do talk about how documentation is often, more important than the software itself for the humans using it. So that contributions to documentation is also really important.

[00:24:16] Gui Castelao: I think the DOIs, like the DOIs are not perfect at all, but I think they address part of this with the Creators category and Contributors, right, so you still can add people that give some contribution but not on the level to be a Creator an alter write and you can move things. I think I like that solution.

Which by the way I have a repo called inception which explaining how to use the dot Zenodo. So how you can automate release if you link with Zenodo.

[00:24:44] Arfon Smith: Oh, nice. It’d be cool to get a link to that project. If you have some examples, we can put it in the notes.

[00:24:48] Abby Cabunoc Mayes: So I know we’ve talked a bit about Rust, and you’re huge fans of Rust, but can you tell me a bit about the advantages of using Rust in this rewrite, and like, why you chose that?

[00:24:57] Luiz Irber: So I’ll go answering that, I think the main benefit of this is that when we started working on implementing this library, we wanted to have something that was maintainable over time, but also as close as possible to the equations that are in the scientific paper. Because those equations are complicated and have a lot of parameters.

So, going with Rust has this benefit that it’s a low level language performance wise, so we could do a lot of parameters in the compiler usually does a very good work of optimizing all that OA, across function calls and so on. But also because it’s a low level language in that sense, it’s a kind of a drop in replacement to a C library, for example.

So we can avoid this practice of rewriting everything in Rust, or like, making all the other projects that use GSW-C have to fix a lot of stuff to be able to support this and we could benefit like the other implementations of GSW that are usually based in the C version to benefit from our work.

And a comment on GSW. GSW, the reference implementation, is in MATLAB. And there is a version written in C that’s used by other languages like Python and R and Julia to have all these features available. So, what we were aiming for is to not replace just GSW-C, but be an alternative focus on keeping compatibility with the other libraries, but also increasing the correctness of the code and how we test, how we document, how we do everything.

And Rust was a good fit for that.

[00:26:43] Arfon Smith: I love it.

[00:26:43] Gui Castelao: just, just reinforce, agree with everything, for me, two things that really raise here is that first is the low level. Because the interest of going to firmware, right? So the same code that go run on my laptop and on HPC, I can use to structure a firmware. And the second as Luiz said, is the zero cost abstraction that we can write the functions as close to the papers as they are.

So it’s very easy to compare, to check, to validate without compromising the performance because the compiler will then organize the whole thing for efficiency. There are a lot of advancements, but these are the two ones that most attracted me for this.

[00:27:18] Arfon Smith: Awesome. Okay. So just a little bit of a pivot to talk about JOSS for a few. So I think your paper was one of the first of 2024 to be published. So congratulations for well, I guess maybe you would rather it was published as one of the last ones in 2023. I have no idea. I was curious if you could just say a bit more about why you published in JOSS?

You touched on this a little bit earlier, but what the motivations were for that and how was the review experience? Anything particularly of note that for both of you during the review?

[00:27:47] Gui Castelao: So it’s not my first paper at JOSS, I think Luiz’s as well, right? We reviewed papers there as well before, we really appreciate JOSS.

There are several reasons to publish with JOSS. And one of them is well, I’m having trouble to which one to say first. One of them, I think, is the culture, right? It’s different than a typical review on scientific papers or proposals where people are just trying to find mistakes, find the trouble, find like, oh yeah, I found a flaw here.

I always felt like, involved on JOSS. The culture is different. It’s like, yes, let’s publish this. What is the best that you can make? Let’s sit together and like, oh yeah maybe if you improve this, it’s strong, if you do that, it’s strong, but always in a positive way, in the sense that we’re going to do this. I don’t know how long it’ll take, but we’re going to make it, and we’re going to make it together. I think this is fantastic for everyone involved, but for the scientific community as well, right?

Another thing that’s very important for me is the efficiency to be centric around developers, right? So instead of investing time on the document, let’s invest time on the documentation of the software, which people, if the software is successful, this is what people are going to actually use and testing to be sure that it’s doing what is expected and people understand.

So for us that are developing software, that’s very efficient. I wouldn’t say that’s less work, but at least you feel that the work is all being put in the right place on things that will be useful for others.

One final thing I mentioned before, I think it’s great to create a space to recognize software developers, scientific software developers, right? So now you have a publication, have a DOI, and especially for the early career, they can show, look, This is what I did in the last years, and this is what I produced, and it’s possible to track the scientific impact, right.

If you go on sourmash and look at all the publications that, that use behind, you can see like, oh yeah, we are thankful for Luiz to invest time on this, because look how all the science that came behind, and maybe the funding agents can recognize more and more these and support more.

[00:29:52] Luiz Irber: I will say when I started my PhD, when I got review requests together with other people in the lab, we had a checklist that was not as complete as JOSS, but every time I went to do a review of a paper, I would go check the repo and see is the code available? Can I install this? Can I run the software?

And so when JOSS came around I might be a bit too excited to jump and start volunteering to do reviews. So like, during my PhD I did a lot of JOSS reviews. And that’s also the reason why I’m wearing this sweater.

[00:30:26] Arfon Smith: noticed that. I suddenly saw the top of the logo. I was literally about to ask you.

[00:30:31] Luiz Irber: Heheheh

Heheheh

[00:30:34] Arfon Smith: That’s very nice. I promised I didn’t send that to you.

[00:30:37] Luiz Irber: Well, eh, you

[00:30:38] Arfon Smith: I probably did, because you’ve reviewed so much, right? Yeah, but a while ago. Yeah, that’s very nice. Thank you for that.

[00:30:45] Luiz Irber: So it was always a focus for me. Science has to be reproducible, especially for the easier, with a lot of quotes, part of it, which is like, software is something that we can run. We can go and do a long term experiment, again, in the lab that will take 15 years to complete. But, software has a much lower bar to reproduce what is being written and published.

So, I really like JOSS because of that. And also, again, like everyone already mentioned, on raising the bar for the usability of software. Because I can’t really use a paper as a dependency on whatever I’m building. I can use software to do more scientific research and figure more things out.

[00:31:28] Abby Cabunoc Mayes: That’s great. And I’d love to hear from you, Luiz, also. What kind of skills do people need to contribute to this software? I see it’s open for contributions.

[00:31:36] Luiz Irber: yes, it’s open for contributions. I will say that probably reading equations and translating to Rust is the largest part. Because one thing we didn’t mention is that GSW has a lot of functions that end up being derived from these equations. How many are the, like, it’s more than a hundred for sure.

[00:32:00] Gui Castelao: a hundred and a few, yeah, I counted that.

[00:32:02] Luiz Irber: And a big part of that is that there are some of the equations that are used more than others. So like we focus on the subset that the GSW-C was implementing. There are some that even GSW-C didn’t implement based on the MATLAB reference implementation. So we still have some to finish implementing.

But I would say that we also structure each function that we are implementing in a way that the function, the documentation and the tests are close to each other. So if you’re coming to contribute, you have a lot of examples on how to help and how to do that translation.

So pretty much like reading equations, translating that to Rust. It’s a very C- like Rust. We are not doing too many fancy Rust things in the code base. And then, attention to detail and patience, because these equations can be quite long sometimes.

[00:32:58] Gui Castelao: Yeah, I’d say a lot of attention and patience. The priority is to do the correct thing.

[00:33:04] Luiz Irber: Yeah, and we also set the developer infrastructure around. Like, we run all the tests in CI with GitHub Actions. We also set up to replace GSW-C in GSW-Python when the R one is not working that great yet. But we run the tests for GSW-Python on top of our implementation to see if we are matching everything.

So, we try to go, like, taking great care of how the functions are implemented, that we are getting all the scientific parts right, and then evaluating across whatever we can that can use this and has tests to make sure that’s also correct across languages.

[00:33:47] Arfon Smith: I was just going to say, it sounds like you’re being incredibly diligent with this library, which sounds entirely appropriate for the type of work that you’re doing. Also I love that it sounds like it’s a nice contributor experience that you’ve got set up there with the code, the tests, and the docs all together.

So definitely sounds like this project is open for contribution. I wanted to close this out by just asking you both how people can keep up with your work online, keep up to date with your work. Are there any particular places you would want people to follow you or subscribe to your work? Luiz, do you want to go first?

[00:34:18] Luiz Irber: Oh, I have a website up. It’s called LuizIrber.org that has blog and talks that I give. I’ve been lowering as much as possible my consumption of Twitter and the likes. So, I do have Mastodon and BlueSky and Twitter accounts, but not very active over them.

[00:34:37] Arfon Smith: And what about yourself, Gui?

[00:34:38] Gui Castelao: I’m a little bit limited on the presence, but I have a website, castelao.net, and @castelao is my handle in GitHub, I think it’s the best way to track me and see what I’m doing.

But I have one question for you both. How can we contribute to JOSS? Like we already do reviews and we already do stuff, but we really think that this is an important thing.

One of the risks that I see is that this is so great that everybody’s seen that and is jumping in. How do we scale that and how can we help more than just reviewing papers?

[00:35:13] Arfon Smith: Well, you’re playing a huge part by reviewing already, and Luiz is actually an editor at JOSS as well, so he’s double, double helping there. One of the sustainability challenges that we face as a journal is just not burning out reviewers and editors. So, I think the most concrete thing I could say is, if you have thoughts on how to make reviewing or editing at JOSS a more sustainable, process for the communities that you operate in that’s always very welcome.

Send us great editor suggestions when we do public calls for new editors, because that happens semi regularly. We just closed that out most recently in December, and we’re currently onboarding new editors. But yeah, it’s a great question, and thank you for the care and attention. It’s appreciated. I agree, JOSS is important, and we want it to thrive and survive for a long time.

[00:36:03] Gui Castelao: Thank you. Thank you both.

[00:36:05] Arfon Smith: Yeah, okay,

[00:36:06] Gui Castelao: rest of the team.

[00:36:07] Arfon Smith: Abby, do you wanna, do you wanna wrap this up?

[00:36:09] Abby Cabunoc Mayes: This was a very fun conversation. I enjoy all the swag, all of the logo offs. It was lovely chatting with you both, and yeah, best of luck with this work. And I’m looking forward to seeing how it grows and how the community grows around it.

[00:36:21] Arfon Smith: Yeah, thanks for your time both.

[00:36:23] Gui Castelao: Thank you.

[00:36:23] Abby: Thank you so much for listening to Open Source for Researchers. We love to showcase open source software built by researchers for researchers, so you can hear more by subscribing in your favorite podcast app. Open Source for Researchers is produced and hosted by Arfon Smith and me, Abby Cabunoc Mayes, edited by Abby, and the music is CC-BY Boxcat Games.

[/expand]

Video

JOSSCast #4: Applying ML to Quantum Monte Carlo simulations – Nicolas Renaud on QMCTorch

Subscribe Now: Apple, Spotify, YouTube, RSS

Nicolas Renaud joins Arfon and Abby to discuss QMCTorch, a PyTorch implementation of real-space Quantum Monte-Carlo simulations of molecular systems, and work to promote research software as a research output.

Nico is the head of the Natural Sciences and Engineering section of the Netherlands eScience Center and Senior Researcher at the Quantum Application Lab. He focuses on the intersection of material sciences and machine learning.

You can find Nico on GitHub @NicoRenaud or the Research Software Directory

Episode Highlights

  • [00:02:31] Introduction to QMCTorch – recasting Quantum Monte Carlo as a machine learning problem
  • [00:09:30] Hardware requirements – run it on the cluster
  • [00:11:05] Choosing PyTorch for QMCTorch
  • [00:17:40] The Netherlands eScience Center and promoting research software
  • [00:18:47] Publishing QMCTorch in JOSS
  • [00:19:02] QMCTorch is open for contributions!
  • [00:20:51] Future directions for QMCTorch

Transcript

[00:00:05] Abby Cabunoc Mayes: Welcome to Open Source for Researchers! A podcast showcasing open source software built by researchers for researchers. My name is Abby.

[00:00:12] Arfon Smith: And I’m Arfon.

[00:00:12] Abby Cabunoc Mayes: And we’re your hosts. Every other week we interview an author published in the Journal of Open Source Software, or JOSS.

[00:00:18] Arfon Smith: So who did we talk to today, Abby? What are we going to learn about?

[00:00:20] Abby Cabunoc Mayes: We talked to Nico, Nicolas Renaud.

[expand]

[00:00:22] Arfon Smith: Yep.

[00:00:23] Abby Cabunoc Mayes: about his paper, and, you know, honestly, reading the paper, I was a little nervous. I thought this would go a lot over my head, but it was actually a really interesting conversation. Both on the eScience Center in the Netherlands and hearing about the work that they’re doing to promote more open source research software.

But also just a nice case study on why Nico decided to make this open source, how he’s been approaching it and the tools that he’s using to build this.

[00:00:45] Arfon Smith: Yeah, it was a bit of a journey through history for me personally as well. I used to do some computational chemistry in a previous life and so learning how some of that tooling has evolved was interesting for me.

I think the thing that’s kind of interesting about computational chemistry is it’s historically quite a closed source field. A lot of mega packages, big, expensive licensed tools that you can buy. And so this I think is a really great example of people leveraging some of the more popular machine learning tooling, PyTorch, applying it to CompChem problems.

So yeah, it was super interesting. And as you say, the eScience Center seems to be really at the forefront of a lot of the best and most interesting work in research software engineering space.

[00:01:25] Abby Cabunoc Mayes: Yeah, and I think there’s a lot of open stuff happening in Amsterdam especially. Because I know I used to work at Mozilla, and MozFest moved there

[00:01:33] Arfon Smith: Right.

[00:01:34] Abby Cabunoc Mayes: I’ve never been to Amsterdam, Stroopwafels are my favorite, I still really want to go one day. But

[00:01:43] Arfon Smith: us some or something.

[00:01:44] Abby Cabunoc Mayes: Yeah, yeah, Yeah.

But yeah, let’s dive right into that interview.

[00:01:48] Arfon Smith: Sounds good.

[00:01:49] Abby Cabunoc Mayes: This is episode number four, and today we’re chatting with Nicolas Renaud on his paper, QMCTorch, a PyTorch implementation of real space quantum Monte Carlo simulations of molecular systems.

Welcome, Nicolas. Or Nico, we’ll call you Nico during this.

[00:02:04] Nicolas Renaud: Nick also works here. Yeah, thank you. Thank you for inviting me. Really happy to be here.

[00:02:09] Abby Cabunoc Mayes: Of course, how’s it going in Amsterdam?

[00:02:11] Nicolas Renaud: Snowy, to tell you the truth. We have snow for the last week and it’s not something we’re used to, so it’s a bit chaos everywhere. But it’s very nice. It also means that it’s sunny, which, you know. It’s not a usual thing in the Netherlands, so pretty happy.

[00:02:23] Abby Cabunoc Mayes: That’s good. It’s snowy here, but not sunny here in Toronto.

[00:02:26] Nicolas Renaud: That’s a bad mix.

[00:02:27] Arfon Smith: It is cold. I can report it’s cold in Edinburgh and not snowy and my kids are furious because about 40 miles away it is snowy and they would like that snow here, so there we go.

[00:02:37] Abby Cabunoc Mayes: Ah, too

[00:02:37] Arfon Smith: wintertime in the Northern Hemisphere

[00:02:38] Abby Cabunoc Mayes: but yeah, let’s jump right in. Nico, do you want to tell us a bit about your background? And then maybe also QMCTorch?

[00:02:45] Nicolas Renaud: Yeah, of course so I come initially from a very academic world, right, so PhD, postdoc and all the rest in quantum chemistry slash material science, so that’s really my background initially. But for four or five years now I’m working at a place called the Netherlands eScience Center, it’s a publicly funded research center in the Netherlands.

Who focuses on research software, so developing research software together with partners and also using this research software to actually do some research with them and publish some results. And at the center, we don’t necessarily focus in a given discipline, we work across all disciplines. So in the past couple of years, I’ve worked in biochemistry, in astronomy, in ecology, so a lot of different topics, but always with the focus on developing the tool that researcher needs to do their research.

We’re pretty broad in that sense, and we’re also pretty broad in terms of technology. So we use a lot of different toolings and we try to make the best choice each time.

So the QMCTorch project came from a project in material science. So with collaborators here in the Netherlands at a university called Twente University in the east of the Netherlands, and with someone called Claudia Filippi, who is really one of the experts in quantum Monte Carlo simulation. And together with her, we started by refactoring to really improve the code that she has. It’s called CHAMP. It’s a code that has been in the community for a long time that has been used by many people to publish many results.

But at the same time, we decided to develop something new, right? And that’s how QMCTorch came about. And the idea there was that while doing the project with Claudia, we kind of figured out that Quantum Monte Carlo at large could be somehow recasted as a machine learning problem. And so we decided to have a machine learning approach to the Quantum Monte Carlo problem.

And little by little, we built a couple of tools, and the last one was QMCTorch that we published a couple of months ago. So yeah, that’s the genesis of the tools, in a nutshell.

[00:04:29] Arfon Smith: It’s really cool. I’m actually quite familiar with some of the work of the center you’re at. There’s lots of great software engineering going on and research software engineering.

I was curious if you could just backtrack just slightly and explain for the audience – I would say this is software that helps understand chemical or system or material, I think you said material science for the particular application here. What’s happening here? Is this quantum chemistry software? Is this computational chemistry? What’s the broad umbrella that we’re working with in here? And why do researchers use these methods?

[00:05:00] Nicolas Renaud: Yeah, so the broad umbrella would be quantum chemistry, so really looking at molecular systems or individual molecules, but through a quantum physics lens. So trying to describe this molecule at the highest degree of accuracy you can think of. And applications are quite broad, right?

So the projects that we were working on there, the application target was really developing new material for photovoltaics. So if you have molecules in photovoltaic cells, they’re going to absorb lights and you want to know which lights, the frequency they’re going to absorb, what’s going to happen after this, how they’re going to transform this light into electricity.

And of course, by tweaking the molecular structure, you can sort of tweak the overall performance of your solar cell, right? If I take a broader picture, that’s really for that project that we’re trying to work on. But there are many other applications, might it be for to design better material because of their strengths or better drugs, because they’re going to bind to different proteins, right? So the target application are quite broad.

Quantum Monte Carlo itself is a very accurate method, right? It means that it’s also quite expensive and it also means that you can only apply it to very small molecules usually, right? So that’s what you see mostly in quantum Monte Carlo simulation, that you’re gonna, you know, play with molecules that have a few atoms, right?

A dozen atoms. So very small molecule and really try to compute the property of this molecule as well as you can. Researchers are interested in quantum Monte Carlo approach because you can parallelize that very efficiently in a way, right? So you can really make use of very large infrastructure to compute your molecular properties.

So that’s really one of the unique criteria for Quantum Monte Carlo, but it’s still a very expensive method, right? It’s a sub community of the quantum chemistry community that is really interested in it. Compared to more popular methods that people use on a daily basis.

[00:06:36] Arfon Smith: So very high level, using a computer to simulate molecules rather than other ways. If I had a jug full of a molecule or some kind of liquid, I could do a physical experiment on it in the physical world, or I can simulate it, right?

So I think I understand the answer to this, but I’m just going to ask you just to verify, but these are computationally expensive because you’re trying to keep track of every single, electron and proton. You’re tracking the interaction, the quantum state of the system continually, right?

[00:07:04] Nicolas Renaud: Yeah, you’re really trying to track all the interaction between all the electrons in your molecule, right? And you can have quite a lot of them if you have heavy atoms. How they interact together, but also how they interact with the nuclei of your molecule. And you do that at the quantum level.

So you’re really trying to define what we call a wave function. So it is very complicated math objects that define where the electrons are, what level of energy they have. But it’s something that becomes very expensive computationally very quickly, right? So that’s why you need quite a large computational resource to run this calculation.

And of course, the goal is to do that on one side. So do all this computational work on one side and then go to your colleague who are in the lab and say, hey, I think that this molecule can do this. What do you think? And try to see if the two matches, right? And most of the time they don’t, so we have to be honest, right?

Because, it’s very hard to compute this accurately. And there are a series of different methods that allows you to reach this level of accuracy that you’re after. And quantum Monte Carlo , if you do it well, if you define this sort of wave function in a proper way, and if you optimize it in a proper way as well, can really lead to very accurate results. But it comes at a price and the price is really a very high computational workload that you have.

[00:08:12] Arfon Smith: So, could you say a bit more about that? So it sounds like this is running on a cluster of machines. I don’t run this software on my laptop, right? Or for a big molecule.

[00:08:20] Nicolas Renaud: For big molecules, no. So if we look a little bit more on the software side of things, when you develop things, it’s impossible to do the development and have your tests that run on a cluster somewhere. So there you have to break it down to something that is more manageable at a local level.

So if you take an H2 molecule, so a molecule with two hydrogen atoms, that you can do on your laptop and you can use this kind of system to actually do your development. But from a scientific point of view, that’s maybe not super interesting, right? So if you want to go to a much larger molecule, then you need to take your software, put that on your favorite supercomputer that you have at your disposal, and then run your simulation there.

If you look at a very large molecule, you’re going to take a couple of nodes of these supercomputers, or let’s say, I don’t know, 200 CPUs, right? And you’re going to use them full time for a couple of days, maybe, right? So that’s really the scale of the simulation we’re talking we’re talking about.

And quantum chemistry at large is very demanding in terms of computation.

[00:09:11] Arfon Smith: So you always want HPC. You always want more compute. You can always use it. If you’ve got it, you can take it. Sounds like it.

[00:09:18] Nicolas Renaud: We work very closely with the Dutch National Super Computing Center, which actually just next door from us. So we do make use of the infrastructure quite intensively. But yeah, without this, you cannot do this kind of work. That’s for sure.

[00:09:30] Abby Cabunoc Mayes: Cool, that makes sense. Going back a little bit, you mentioned the solar cells, and I saw that this was developed as part of the A Light in the Dark project. Can you tell me a bit about the outputs or what insights came out of using the software in that experiment?

Yeah.

[00:09:45] Nicolas Renaud: yeah, a very good question. The, so the code was really aiming at this application right? We had a couple of roadblocks, one of them being COVID and we, you know, we all suffered into this, so the project really ran mostly during COVID. And the project was really much more geared towards the software side of things, right, and the methodology development than the application, in a way, right?

So we did a lot of software, as I said, so the software is now on GitHub, is now openly accessible for everyone, and it’s been used by a lot of people. And also out of this software, we managed to create a broader consortium that really aims at pushing quantum Monte Carlo methodology within the the material science community.

So that was one of the main outputs in terms of scientific outputs. It may be something to edit somewhere, but we didn’t have that much, unfortunately, for that project. And partially it’s because we looked so much on the method, right, and the PhD students working together with us did quite a lot of work on how to really optimize this, you know, electronic structure problem the best way possible.

So really on the mathematic foundation of this and its application, much more than towards the the application, right? So I kind of, I cannot tell you we found a very brand new molecule that is much better than all the other molecules, because in the end, we really focused on the software side of things.

[00:10:54] Abby Cabunoc Mayes: Well, it’s great that QMCTorch still came out of that, and you still have things that are useful for the scientific community. So another question I had was why PyTorch? Why are you building this on top of PyTorch?

[00:11:05] Nicolas Renaud: So when I started that, that was a bit my first try at machine learning. At large, that was really the first time I was putting my hand into it. PyTorch at the time, and probably still now, was still, you know, very popular library with a very large ecosystem, with very good documentation. That’s maybe why I decided to go to PyTorch.

That was actually the fastest one for me to learn compared to other workflows. It could have been done in other workflows, right? At the end of the day, you know, I just need something that gives me automatic differentiation and I can probably do the same. But I think thing that PyTorch did better than anyone else, at least at the time, was documentation and tutorials.

That was really something that was really strong at the time for PyTorch. That allowed me to really quickly get into it and integrate all these ideas together.

[00:11:46] Abby Cabunoc Mayes: Yeah, no, that makes sense. PyTorch was also my first foray back into machine learning when I was like, I want to play with this a little bit, and yeah, it’s a good tool.

[00:11:54] Arfon Smith: So I was going to ask I’ve had a little bit of familiarity, but imagine 20 year old state of knowledge of some of the packages that people might use in computational chemistry. And I’m pretty sure I’m right in saying a lot of the big old school ones, like I’m thinking like Gaussian and Q-Chem and all these things, like most of them are closed source.

I think most of them are not open source packages. I was wondering if you could say a bit more about the state of open source software in computational chemistry today. Has that changed much since my knowledge of how the world was back in the mid 2000s? But also, why did you decide to release this as an open source package?

[00:12:29] Nicolas Renaud: So it’s changing, I think. I have the feeling, at least, that it’s changing. But you’re right in saying that most of this very well known package are closed source. And most of them comes behind the license, right? Which is also, you know, a different business model that works well.

And the open source movement is not that strong in this community. I’m not fully sure why, I have to admit, because if we compare the work in other communities , where everything is open by default, right? So, yeah, computational chemistry at large and quantum chemistry in particular is a bit on a different approach there.

So it’s still very much a little bit like this. You have more and more packages that comes open source and adhere to the open source model, right? But the big names are still very much what they were 20 years ago. I can confirm that .

And the code we started with, so the code of Claudia, right? So CHAMP, it’s called, but it was the same thing. It was closed source. But slowly by also advocating for the open source movement, what we managed to put that in the open source domain, right? But it’s a change of approaching the code, right?

It’s really a change of making the code available for everyone that wants to use it. And with this, of course, you need to also make sure that your code looks well, right? And that it’s, you know, people look at it and say, okay, I understand how that works. So you also need documentation. So we also made a lot of effort there, right? To create documentation for our code. So that other people can also use it. And I think it’s working well. I mean, it’s always been a very good code, but now it’s also open.

And so why QMCTorch is open? Because that’s what we do at the eScience Center. That’s one of our funding principle, even, I would say, when we start the code from scratch, it’s open. That’s what it is. And this is because we are funded by the government. So it’s only public money funding us, right? And therefore we have the feeling that all of that should be open. So when we start something from scratch, yeah, there is absolutely no question for us. This is how we work. And we do this on day one.

There’s also another thing that’s, you know, some of our research partners, okay, I’m going to do all the work and at the very end of the project, I will open source the code, right? And okay, I understand. But we try to do that, you know, on day one, we start the project like this.

So that also forces you to write your code in a better way because it’s public. Everybody can look at it and you don’t want them to have a bad impression of your work.

[00:14:24] Arfon Smith: I like to say to people that the best time to open source your code is at the moment you make the repo. It’s just easier then.

[00:14:29] Nicolas Renaud: don’t

have to think about it anymore.

[00:14:31] Arfon Smith: Nobody’s looking, nobody’s gonna laugh at you if you write bad code. I think that sort of fear of being watched is very real for some people.

[00:14:38] Nicolas Renaud: definitely.

[00:14:39] Abby Cabunoc Mayes: Yeah, no, big fan of working in the open. It is scary for sure

[00:14:42] Nicolas Renaud: Yeah. Yeah.

[00:14:42] Abby Cabunoc Mayes: So did you start doing more open source when you came to the eScience Center? Or were you already doing open source before

[00:14:49] Nicolas Renaud: No, it’s very much when I joined the eScience Center that I started doing this. I also admit that before I joined I was also a quantum chemist doing close source development. And it’s really coming here that I was directly exposed to it. And you know, so we have a lot of training program both internally and externally that also help people learning how to do that best, right?

Once you have this sort of familiarity with it, it’s also much easier to develop your code like that. But it was something I had to learn when I joined, that’s for sure. Yeah,

[00:15:16] Abby Cabunoc Mayes: Yeah, that’s awesome. Yeah, I think my first job, I the lab I was in, everything was published open source. Everything was released openly, and I’m so glad I did that. Otherwise, I wouldn’t have seen how powerful it can be.

[00:15:26] Nicolas Renaud: Yeah.

[00:15:27] Abby Cabunoc Mayes: I’m glad there are institutes like that.

[00:15:29] Nicolas Renaud: Indeed, yeah. And I have to say that it’s also a good vision from the Dutch academic world who is really telling us to do this. So there is a strong vision for that to happen. And yeah, it’s quite unique that there is, at least in Europe , such a big center, because we’re about a hundred people working here now, who is really developing open source software for the research community, but it really comes from the top. It’s really the Dutch government who say we should be doing this. So that’s pretty unique, I think, in Europe.

[00:15:52] Abby Cabunoc Mayes: And I guess while we’re on the topic of the eScience Center, I know in our emails when we’re scheduling this, you talked about the research software directory and some other work that you’re doing just to help promote research software as research output. Do you want to talk a bit about the other work that you’re doing there?

[00:16:07] Nicolas Renaud: Yeah, of course. So the research software directory is something we’ve put together over the years, and I think we’ve had the first release couple of years ago, something like this, and it’s really for people to be able to showcase their work as research software. So a place where they can highlight their research software, make sure that other people can find it and try to reuse it and try to understand what the code does.

So we started only with our organization and only our software there. Now we have about 50 ish organizations putting their software there as well putting also their projects. So we can link projects or research projects and research software to really put the software in its own context, right?

So we need for people who come from the outside to understand what the software does and what maybe they would like to use it. And overall, we’re trying that for two things, A, to push for the development of open source software itself, right, to really invite people to do this, but also for research software to gain recognition in a way, right, so for the broader academic landscape to understand that it is a very real research output.

Publishing a software is also something that is important for researchers, and that is not only papers and citations, but also the tool you’re publishing and you’re sharing with the community is equally important.

And slowly, we see change in this, right? In the recognition that research software has in the landscape. There’s still a lot of work to do, but I think we’re going in the right direction with this. And the Research Software Directory is one of the tools we’re using to promote all of that. We also, with people from GitHub as well, we push for the citation file format quite a bit at some point, right?

So we’re also helping that in the back. And that as well, it’s another way of promoting research software as a main output and not as a byproduct of research.

[00:17:40] Arfon Smith: Yeah, you can Put a citation. cff file in your repo, or actually any citation file, and GitHub will detect those. That was a really nice collaboration with that group. This feels like a good segue to just ask briefly a little bit about, what was the primary motivation for publishing in JOSS?

Was there something you were particularly looking for as part of that review? Or was it more about, sharing the software with the world? Could you say a little bit about what was your thought process?

[00:18:05] Nicolas Renaud: It’s something that we always encourage in all our projects when we develop software to consider a software publication, right? And JOSS is one of the prime journals that we recommend together with SoftwareX and a few others, of course. What I really like personally about JOSS Is that the review is on your code, right?

The paper is nice, the text, everybody gonna read it and maybe there’s a nice picture, but the review is on your code, only on your code. And that is something that makes it very different from the other journals. And the interaction, especially for QMCTorch with the reviewer was actually really nice.

They pointed a very valid point, right? That any other review that is more on the paper would’ve completely overlooked,. So we really like that a lot, right? ‘cause it’s really doing the review of the software on the software. So yeah, we are big fans of JOSS.

[00:18:47] Abby Cabunoc Mayes: That’s great. And it’s really great that QMCTorch is open. So one thing I really love about open source is just the ability to build collaboratively with others and getting outside contributions.

Is QMCTorch open to outside contributions? If people want to contribute, how can they?

[00:19:02] Nicolas Renaud: It is very, and I will welcome any contribution, that’s for sure. If they want to contribute, I think I set up a couple of templates in the issue, right, to either request features or report bugs. I had a couple of master’s students who actually worked on it as well, right, doing the developments, and we developed it like this.

And, you know, it’s still a very niche software, I have to be honest, right? It’s for support of the quantum chemistry community. So a good understanding of the domain always helps, right? That’s for sure. And willingness to use it, I think that’s what I’m doing now. I really decided to publish the software first, right, and then to use the software to publish results.

So that’s what I’m doing now. So if people want to collaborate and publish results with me, I will be very glad to do that together. But that’s something I want to achieve this year, right, to really show now to a more domain oriented paper what the code can do, right, and how to use it, how to further develop it.

And one thing that is interesting, really, on the domain side of that. You know, if I go back to explain the quantum Monte Carlo problem, the thing that QMCTorch does that maybe other software are not really doing is allowing people to create quite easily, right, the way they want to describe their molecule.

So it’s very complicated mathematical object, right? This wave function that we have, but in the way QMCTorch is built, we can really have like different modules that we can assemble. And people can define their little neural network that’s going to compute a little part of the bigger wave function and experiment with it and see if it leads to better result or worse result.

And that’s something I would like to see, right? To have a lot of library of this sort of sub module, right? To see how they perform. And once again, what’s through PyTorch, right? We’re doing really most of the heavy lifting. They really have to define a very simple function, very, you know, vanilla neural network.

And they can put that in QMCTorch and see what type of result you get. So that’s really the plan that I have for this year. To experiment a little bit more with it and publish those results.

[00:20:51] Arfon Smith: That sounds like lots of work to do, lots of opportunities to exploit the software for great new science. Thanks for giving us that overview. I guess this maybe is a good place to wrap up. I just wanted to say, you know, how do people keep track of your work and what you’re up to?

Is there a place online people can follow what you do? What’s the best way to Keep up with the work that Nico is doing and you know, the development of QMCTorch.

[00:21:16] Nicolas Renaud: I’m not a big social media person, I think the best place to see what I’m doing is really GitHub. That’s where I’m the most active. Right. It’s not because you’re here that I’m saying this. I think it’s true. No, and all the development we do on all the project there.

Right. It’s also something we do internally to see what people are doing at the center. Right. So we try to keep an overview of all of this. And that’s really the main place where we publish everything we do in the end. Or the the Research Software Directory is another good place I should mention.

[00:21:38] Arfon Smith: Fantastic. Well, I think that’s a wrap. Thanks again for your time today, Nico. It’s been really interesting to talk to you.

[00:21:43] Nicolas Renaud: Oh, thank you for inviting me. Really happy to be here. It was really nice.

[00:21:47] Abby Cabunoc Mayes: Thank you so much for listening to Open Source for Researchers. I loved that conversation with Nico. We showcase open-source software built by and for researchers. So please subscribe in your favorite podcast app. Open Source for Researchers is produced and hosted by Arfon Smith and me, Abby Cabunoc Mayes. Edited by Abby and the music is CC-BY Boxcat Games.

[/expand]

Video

JOSSCast #3: Studying Superbugs – Juliette Hayer on Baargin

Subscribe Now: Apple, Spotify, YouTube, RSS

Juliette Hayer joins Arfon and Abby to discuss Baargin, an open source tool she created to analyze bacterial genomes, especially those resistant to antibiotics.

Juliette is a PhD Researcher at the French Research Institute for Sustainable Development (IRD, Institut de Recherche pour le Développement), at the MIVEGEC research unit, where she implements computational biology methods for bacterial genomics and metagenomics to understand the circulation and transmission of antimicrobial resistance.

You can find Juliette on GitHub (@jhayer), ResearchGate, and X (@juliette_hayer).

Episode Highlights

  • [00:02:21] Introduction to Baargin: Juliette explains that Baargin stands for Bacterial Assembly and Antimicrobial Resistance Genes Detection in Nextflow. She developed it to analyze the genomes of drug-resistant bacteria in various environments.
  • [00:06:20] Multiplex Sequencing: Juliette discusses the challenge of assembling genomes for multiple strains simultaneously using high-throughput sequencing technologies.
  • [00:07:21] Next-Gen Sequencing and Assembly: The conversation delves into next-generation sequencing, the assembly of short reads, and the emergence of long-read technologies for comprehensive genome analysis.
  • [00:09:59] Target Audience: Juliette identifies microbiologists as the primary audience for Baargin, emphasizing its user-friendliness for researchers producing genome data.
  • [00:12:50] Nextflow in Bioinformatics: Juliette explains the role of Nextflow in bioinformatics and its popularity, highlighting its benefits for scalable and reproducible workflows.
  • [00:17:03] Open Source Philosophy: Juliette shares her commitment to open source principles, advocating for transparency, reproducibility, and collaborative contributions in research.
  • [00:19:20] Research Using Baargin: Juliette discusses her published studies, including the identification of drug-resistant E. coli transmission in Chile and ongoing projects in Vietnam and Cambodia.
  • [00:20:14] Publishing in JOSS: Juliette describes the benefits of publishing in the Journal of Open Source Software (JOSS), emphasizing the focus on code and transparent review processes.
  • [00:23:27] Documentation Importance: The hosts discuss the significance of documentation in software development, with Juliette highlighting its critical role in ensuring usability.
  • [00:26:03] Contributions and Skills: Juliette welcomes contributions to Baargin, mentioning that comfort with git and Nextflow is essential for potential contributors.
  • [00:28:27] Future Roadmap: Juliette outlines plans for extending Baargin, including adding tools for predicting resistance genes, improving detection of mobile genetic elements, and enhancing multi-locus sequence typing.

Transcript

[00:00:05] Arfon Smith: Welcome to Open Source for Researchers, a podcast showcasing open source software built by researchers for researchers. My name’s Arfon.

[00:00:12] Abby Cabunoc Mayes: And I’m Abby.

[00:00:13] Arfon Smith: And we’re your hosts. The way this works is that every other week we interview an author published in the Journal of Open Source Software and talk about their work. Today we talked with Juliette about Baargin, this software that’s responsible for assembling genomes for these bacteria that have resistance to drugs, antibiotics.

When I choose to worry about the world, this is the thing I worry about. The fact that we have these bacteria that are increasingly becoming resistant to antibiotics. It seems like a really important piece of software for potentially the future of humanity.

[expand]

[00:00:46] Abby Cabunoc Mayes: Yeah, no, I definitely agree. Whenever I’m taking antibiotics, I get worried. It’s like, do I really need this? . I don’t help more superbugs come around.

Yeah.

[00:00:54] Arfon Smith: for sure . And also, I had a year back in 2007 to 2008 working in Bioinformatics Institute. I think you’ve spent time in bioinformatics too.

[00:01:04] Abby Cabunoc Mayes: Yeah, yeah, I actually have a degree in bioinformatics and then I joined bioinformatics lab. I worked there for about five years. So I probably should have known a bit more about bioinformatics when we were talking with Juliette. I also realized I’m still calling it next gen sequencing.

It’s been a decade. Is this still the next gen? Who knows?

[00:01:21] Arfon Smith: yeah, yes, but we’ll forgive you. I think fine. Yeah, it was called NGS, right? I think as I remember it as well.

[00:01:29] Abby Cabunoc Mayes: Now it’s high throughput sequencing, is what she said. probably the more modern way to talk about it.

[00:01:36] Arfon Smith: So the sort of TLDR for today’s episode is, existential risk to humanity –pay attention , listen to two out of date people talking about bioinformatics, and updating their working vocabulary on the topic.

[00:01:48] Abby Cabunoc Mayes: It was great hearing her experiences both in JOSS, but also just creating these workflows for other bioinformaticians to use.

[00:01:54] Arfon Smith: For sure. Shall we jump into the conversation?

[00:01:57] Abby Cabunoc Mayes: Let’s do it.

[00:01:58] Arfon Smith: This is episode 3 and we’re talking with Juliette Hayer about their paper Baargin and Nextflow Workflow for the Automatic Analysis of Bacterial Genomics Data with a Focus on Antimicrobial Resistance.

That is a long sentence but it is a great paper so we’re going to talk about it. Juliette is a PhD researcher at the French Research Institute for Sustainable Development, and welcome to the podcast, Juliette.

[00:02:21] Juliette Hayer: Thank you very much for Inviting me.

[00:02:24] Abby Cabunoc Mayes: Of course. Just to dive right in, so I know before we started recording, you told us how to pronounce Baargin. Can you tell us what that stands for? I know it’s an acronym, and maybe a bit about why you started it?

[00:02:36] Juliette Hayer: Yeah. So the acronym to start with is

[00:02:39] Abby Cabunoc Mayes: Editing Abby here. So we lose Juliette’s audio for just a split second and she’s explaining the acronym. So Baargin B A A R G I N stands for Bacterial Assembly and Antimicrobial Resistance Genes detection in Nextflow. I’ll let Juliette continue to explain why she started the project.

[00:02:56] Juliette Hayer: But the way I started is that in the framework of my research program at IRD, so at the MIVEGEC research unit where I work, part of my work is to implement bioinformatics tools to investigate the circulation of antimicrobial resistance bacteria and antimicrobial resistance genes between human, animal and the environment.

I collaborate a lot with other researchers from Southeast Asia, Africa, and also South America. And so in this context, I developed Baargin.

You may know some bacteria that you can actually find everywhere, like the Enterobacteria. Some are very famous, like Escherichia coli E. coli also called. They can be found in any kind of environment and some of them can become pathogen. For human or for animals. And some of them can also become highly resistant to drugs through antibiotics. So that’s what we study. And, it is actually also the overuse and the misuse of the antibiotics that’s leading to this high amount of resistance among the bacteria.

And, it’s maybe worth, noting that WHO has said it’s a major problem for public health this, antimicrobial resistance. So in some of the research that we do, with the collaborator in Southeast Asia, Africa, and South America. We use genomics and metagenomics approaches to investigate the circulation of these bacteria and their resistance genes.

So the genome is all the genetic material of the bug, of the bacteria, and that’s what we study, and we try to sequence with high throughput sequencing technologies. And within this kind of project, we are producing quite a large amount of data that needs to be analyzed, and we need to compare these different bacterial strains, and that’s why I started with Baargin. I wanted to develop a bioinformatics workflow that would be easy to use, flexible, and highly scalable to be able to analyze hundreds of different strains at the same time. So that’s how I started.

[00:05:14] Abby Cabunoc Mayes: So when you’re actually doing the sequencing, you’re putting hundreds of strains all at once that get sequenced all together. Is that correct?

[00:05:21] Juliette Hayer: exactly.

[00:05:21] Abby Cabunoc Mayes: Yeah, that’s very different than what I’ve seen before. I haven’t done bioinformatics in a long time, but usually it’s one specimen, yeah.

[00:05:31] Juliette Hayer: exactly. So now you can multiplex, we say, put several strains at the same time on the same flow cell for sequencing and you get a very high throughput of data. So then you, of course, have to split what belongs to who , but you can sequence many at the same time.

And currently in the project. One project in Cambodia that I’m working with, we have about 700 genomes of bacterial strains, mainly Enterobacteria. So that’s also why we needed something like this.

[00:06:04] Abby Cabunoc Mayes: Yeah, that’s really interesting because even just with regular high throughput sequencing, it’s a challenge to put the genome back together again. The genome assembly is still tricky, but here it’s just another layer where you’re putting together multiple different genomes.

[00:06:19] Juliette Hayer: So you do it in parallel, yes. The genome assembly is the step that takes the most computing resources of course, but we have now very nice tools that have been developed for doing that and that are quite efficient and do not use that much as they did before.

The one I have included in Baargin is called SPAdes. It’s one of the most famous for assembling bacterial genomes. It’s very powerful, but, the thing with Nextflow, the workflow manager that I’ve used to develop Baargin is that it can parallelize the job for like hundreds of strains at the same time.

So that’s what’s cool. Depends also on your hardware as well.

[00:06:59] Abby Cabunoc Mayes: Yeah, and just for anyone listening who’s unfamiliar with next gen sequencing, and you can correct me if I’m wrong, Juliette, but there, they, instead of just reading a genome one letter at a time, it like splits it all up into tiny pieces, sequences them all, and then tries to stitch them all back together.

So it’s a fun computational challenge, I think. but yeah.

[00:07:20] Juliette Hayer: Correct.

[00:07:21] Arfon Smith: So, both Abby and I, I think, have been in past lives worked at Bioinformatics Institute. So I actually worked for a year, at the Wellcome Trust Sanger Institute in Cambridge, which was the, one of the places that sequenced the original human genome. So I know a little bit about what were called next generation sequences back in 2008 or something, I guess.

I was curious, what’s the hardware that you’re using here? Is it those very short read sequences that are being assembled together? Or is it, I know there’s some like nanopore stuff that was sort of magical future looking stuff. Like, what’s the actual tech that’s running under the hood here?

[00:07:59] Juliette Hayer: Yeah. So we actually use both. We use basically the short reads technology, which is now, the market is led by Illumina still. The length of the sequences that are produced is usually 150 base pair. So you get two sequences for the same basic sequence, and then you get a lot of them. So that’s very high throughput, which is very, very nice.

But then you need to get the assembler reconstruct the longer pieces, which we call contigs. the longer sequences. And sometimes it can be difficult. Most of the time, you cannot get the full chromosome of the bacteria in one go with this. So now they develop the long read technologies like, PacBio and Oxford Nanopore technologies.

And we also use that because that is very amazing to get the full structure of the chromosome or of the plasmids as well, which are the small circular pieces that also run into bacteria and usually carry a lot of resistance genes. So it’s also nice to get their structure fully. And the best of the best is to combine both.

So we get the very high quality because of the high throughput of the short that we can map on the long reads. So align, and then you get a perfect resolution of the genomes. And Baargin can take both, either only short reads, or it can take short and long reads as an input, or the already assembled contigs if you wanted to assemble it by yourself before and just go with the rest.

[00:09:38] Arfon Smith: That sounds really powerful. It sounds like you’ve thought about all these different technologies and made it very flexible for different scenarios. Who’s your sort of proto typical user? Is it a researcher or would you expect a sort of analyst or an engineer to run this code?

Who do you find uses your software most?

[00:09:56] Juliette Hayer: So my very first audience are my colleagues in my research group and also my collaborators abroad. Some of them really needed something like that. Of course, because it’s in Nextflow, you have to run it with the terminal, so you have to have some basics in Unix, just to get it to run. But yeah, the main audience is microbiologists because I think many people nowadays, researcher or engineers or lab technician, they will produce genome data for their strains because now it’s something that you do basically every day that, okay, you get a new strain. You will sequence it.

It can be a researcher or other people, but definitely working with microbiology. But yeah, I would advise to have some skills in UNIX. I didn’t make a user interface yet, easy click button thing. But that maybe should be in the plan,

[00:10:54] Arfon Smith: That’s okay, I think there’s a whole collection of tools out there that don’t have like a nice GUI. The interface is the terminal.

I was actually curious, say I buy myself a sequencer or I wanted to use this tool, are people typically running on big clusters? Is there lots of compute under the hood? Like, how big are the jobs? It probably depends on how much data you have. But, typically what sort of hardware would people be running this tool on?

[00:11:19] Juliette Hayer: So I think basically if you have just a nanopore, plug to your laptop and you just sequence maybe, let’s say, four or five strains of enterobacteria that have a genome that is like maybe five megabases, something like that. You could run it on a laptop. I think that would be fine. If the assembly can go on, then you’re fine.

Of course, you can customize the databases that are used for detecting the antimicrobial resistance genes and the plasmids, and also to make the annotation of the genome after all. If you have a lot of space, if you have an HPC in your lab, it’s better if you want to run a lot of strains at the same time, and also to install the larger databases that also provide more power of prediction.

I made it so it could be installed just in a laptop with minimal databases, just to get some results first.

[00:12:19] Abby Cabunoc Mayes: Yeah, so if I get my hands on a Nanopore, and I really want to know what bacteria is growing in my bathroom, I could maybe, if it’s not too big, potentially sequence it. And run it through Baargin

[00:12:28] Arfon Smith: , Let us know if you take that project on as a side hustle. Abby,

[00:12:32] Abby Cabunoc Mayes: I will, yeah.

[00:12:33] Arfon Smith: the results. Laughter.

[00:12:34] Abby Cabunoc Mayes: So you talked a little bit about Nextflow. Can you tell us a bit about its role in bioinformatics? I think Nextflow is more recent than what either Arfon or I have used. So, it would be great to hear, what it does.

[00:12:46] Juliette Hayer: Nextflow is a great, workflow manager. Of course, there are others that exist, so you have different schools with different kind of people. I know there is also SnakeMake. I haven’t used it that much, even though I really like coding in Python, and SnakeMake is based on Python.

But Nextflow, I think they started around 2015 maybe. I hope I’m not wrong, I haven’t checked that before. But I started using it in, 2017 or 18 when I was then at that time in Sweden and I enjoyed really a workshop that I went to in Barcelona where I started learning about Nextflow and how to use it and how to run pipelines and then I met first people of the NF Core community.

I don’t know if you have heard about this, so that’s how Nextflow became very big in bioinformatics, because a few people that started using Nextflow for building bioinformatics pipeline, they met together and they started a real community and they started to put some guidelines on how you should develop a Nextflow pipeline for bioinformatics.

So I think they are huge nowadays because they have put out many pipelines, and many that are really highly used by people in bioinformatics.

[00:14:10] Arfon Smith: So Nextflow sounds like it’s super popular. Yeah, I was also familiar with Snake, SnakeMaker as well. I’ve not used either. I know they’re both popular and there are other tools. Again, I’ve like aware of like Galaxy workflows. I think that’s another tool.

[00:14:23] Juliette Hayer: That’s different. Yeah,

[00:14:24] Arfon Smith: that’s another one again.

There are a bunch of reasons to use a workflow management tool, but one of them is just reproducibility and the ability to reliably execute a set of tools those tools. It makes tons of sense. I was curious though, that workflow is executing Baargin at some point, it’s gonna be a step in the process.

What other tools before this one were available for these kind of tasks? Did you find yourself unhappy with what was already available? What else people might use if they’re not using the software you’ve created?

[00:14:54] Juliette Hayer: So, maybe just one point about Nextflow again that I’m not sure I have mentioned, but it’s based on the language Groovy, which is itself based on Java, and I haven’t that, so that’s that’s worth it. For the other tools that can do several things as what Baargin does, yes of course they exist because many people work on this kind of questions.

One that comes to my mind seem to be really great is Bactopia. I think this is also made in Nextflow but it’s huge, it’s very complex, it’s , in my opinion, quite not easy to install for any user. So, for bioinformaticians and people that are skilled and that know what they are looking for, in the bacterial genome already and which sub workflow they know what they want to do, use, then I would definitely go for Bactopia.

For me, I wanted really something more simple than this and that really had a focus on detecting the plasmid features and the antimicrobial resistance combining different tools that can predict resistance genes.

And there are also other great tools that exist, but can be specific only for one bacterial species. Like I know about Kleborate, which is a pipeline, also very nice, but only for Klebsiella species, which is a kind of bacteria.

So that’s how I ended up coding Baargin. Also to distribute it to our collaborators that not necessarily have the the computing power that is required for installing very large workflows with a lot of databases and all.

So I wanted something lightweight as well.

[00:16:36] Abby Cabunoc Mayes: It was interesting hearing you talk about these, workflow systems, especially Nextflow and things like Galaxy, and how they’ve built a community around that through best practices and these conferences and stuff.

I know you’ve made it primarily for your collaborators, but, do you think there’s room for this to be more widely used with others? Why did you make it open source to begin with, I think was the real question that I was asking around, but go ahead.

[00:17:00] Juliette Hayer: Yeah, so why did I make it open source? Because I’m a researcher and I’m working with collaborators for from other academic institutions all over the world and I think everything we do should be open source. That’s really something I believe and in France in the academic institutions we prone a lot for open source and even more at my institution at IRD. So that is really something that I could not imagine differently.

Also for reproducibility and for all the people to be able to contribute, that’s also important. And to know what they’re doing when they are running that, they can go into the code and it’s not a black box.

So I think for me, that’s very important.

[00:17:46] Abby Cabunoc Mayes: So Juliette, have you run many studies using Baargin yet?

Any interesting insights you want to share?

[00:17:51] Juliette Hayer: So, me and my collaborators and students are using Baarging a lot. I have one study published where we collaborated with Chilean colleagues, where we identified the transmission of super resistant E. coli between wild animals and livestock and companion animal in farms within and between farms in central Chile.

So that was one study. And then we have other studies ongoing in Vietnam, in different hospitals where we isolated Klebsiella strains where that also are resistant to carbapenem and colistin antibiotics. So that’s also interesting. One of my students is working on that. And we have an ongoing project with Institut Pasteur in Cambodia and Battambang Hospital in Cambodia, where we have a lot of bacterial genomes.

That’s the one I talked about before, about 700 different bacteria that we collected from patients to start with, that came to the hospital with resistant infection and we went to their household to collect also bacterial strains from their environment, food, and animals that they have.

We have not published that yet, but we are studying, analyzing the results at the moment. So that will be interesting results that will come soon, hopefully.

[00:19:16] Abby Cabunoc Mayes: Yeah, it sounds like actual useful information that will help people in the world. So, that’s great

[00:19:22] Juliette Hayer: yes

[00:19:23] Abby Cabunoc Mayes: cool, so we’ll link the published studies in the show notes.

[00:19:25] Juliette Hayer: Yeah, it’s really to try to understand how the resistance circulates between the humans and the animals and their environment, like what do they carry, which resistance genes they carry and how do they share it. But yes, I will send you at least the publication that is already published and will keep you updated for the next steps.

[00:19:49] Abby Cabunoc Mayes: Yeah, that’s great. Yeah, we’ll include that in the show notes for sure.

[00:19:51] Arfon Smith: If we could just switch tracks for a second and just talk about JOSS and the fact that you published there. Looking at the paper, I think it was back in October 2023 when the paper was finally published. I was curious if you could just say a bit more about, why you published in JOSS and tell us a little bit about that experience.

[00:20:09] Juliette Hayer: So the first idea came from one of my co authors, Jacques Dainat, who had already published in JOSS, and he told me about how amazing was the experience, and how we should always do that when we publish bioinformatics workflows or tools, because everything is centered on the actual tool or pipeline.

And that is central. And also the fact that the code is revised as well, is reviewed by other people that also know about what you’re doing and probably are in the same field, because you can choose the reviewers. You have a list of potential reviewers in JOSS that you can select and you have an idea of what they are working with.

And of course, you can visit their GitHub, so that also helps. So that was one great thing, and also everything is open during the review process and transparent. Anyone can see what’s happening with the code, and you can get very nice input from the reviewers to also improve your code and the documentation, which is very important.

If I had not published in JOSS, maybe I wouldn’t have put that much effort in writing a well structured documentation with examples and tests and all these things that are actually very good for a workflow, for bioinformatics workflow, because I have seen a lot of other workflows that could be published in more standard life science journals.

And then, what? Usually they are almost impossible to install or there is no documentation because they didn’t need to do that to publish it. So here, yeah, I think it’s good that the code and the documentation is at the center of the whole thing. And also, discussing with the reviewers over GitHub, a tool that we use on a daily basis, that also makes things very convenient.

[00:22:08] Arfon Smith: Sounds like you had a good review experience. I just had a quick look over the review again, before this conversation and it looked like it was a super productive conversation and lots of good feedback. I think you’re right. One of the key areas where I think authors get most value is the review of documentation. It is so valuable to have somebody know nothing about your tool and just, start with a fresh blank slate, new machine, new environment.

There’s probably another name, which is like Journal of Open Source Documentation or something. It’s the software of course, but usability often begins with great docs and clearly defined dependencies. But it’s just so hard to be objective as the author of software when thinking about what would somebody need to know and the sort of undocumented steps. If we had to look where most changes get made as a result of a JOSS review, my guess would be docs every time.

I think it’s the most common area of change and just reinforces that software is not just about the code executing on the machine, but it’s all the bits around the side that are so important for the humans as well who are going to be operating that software

[00:23:21] Juliette Hayer: Yeah,

[00:23:22] Abby Cabunoc Mayes: One thing I do like about that process is just how it pushes you to make it more usable for a broader group. I know we talked about, like, are you building that open source ecosystem? This is one step towards that, having that good documentation so others can

[00:23:34] Juliette Hayer: That’s true,

[00:23:35] Abby Cabunoc Mayes: jump in and use it.

Is there anything else you learned going through this review process or you’re grateful for?

[00:23:40] Juliette Hayer: Yes, they really insisted on running test, so it’s not easy to make unique testing and continuous integration when you work with workflows, so I’m very open to suggestions for that and to contribution if people want to help with this, because it’s not an easy and trivial task. They gave some hints and some input regarding things that I should try .

I haven’t completely done it, but I have made a full test that can test where you can test all the mandatory steps of the pipeline, but it doesn’t test process by process. So that was useful to really make me think about the things that I should improve from that point of view also.

[00:24:20] Arfon Smith: Testing something like a workflow tool is probably pretty hard, right? There are some things where we have this challenge with JOSS where we, as part of the review process, as you probably remember, the actual language is a little unusual. We say reviewers must be able to objectively test the functionality or verify the functionality of the software.

So we don’t say, you must have a million unit tests and 99 percent test coverage. What we’re trying to get to is: You need to be able to verify that this thing works. And that seems like a reasonable thing. But, there are a number of times when that actually can be quite hard. You know, when it’s a complex system, maybe it’s running on a cluster.

It’s really hard for a user to verify that. Custom hardware often makes it hard for people where, you know, you need a particular variant of a GPU or something. Another one is actually just complex user interfaces. If it’s just a command line tool with standard inputs and outputs, that can be quite easy to write normal tests.

If it’s a really complex set of interactions the user has to do, the testing can be a huge amount of the work, actually. So, yeah, I hope we find the balance there.

[00:25:33] Juliette Hayer: But I think that’s something very nice with JOSS is that at least it forced you to have a test that is running on the machine of other people, which not everyone can pretend to have when they publish.

[00:25:47] Arfon Smith: It’s true. It’s true. Absolutely. Yep,

[00:25:49] Abby Cabunoc Mayes: So you mentioned that you’re open to testing contributions. So if people do want to contribute, what sort of skills do they need? What languages? What sort of background are you looking for?

[00:25:58] Juliette Hayer: I think they need to be quite comfortable or confident using git and Nextflow. That’s the basic, because the workflow is coded with system DSL 2. So I have all the steps as modules. All the processes are in separated modules that can be reused in other workflows as well. So it’s quite easy if you can code in the Nextflow to add a new module and then, clone the repo and change the main script accordingly to add the step you want to be added in the middle, in the full workflow. So people that also want to modify a part of it or add a step that I did not necessarily have in Baargin already, they can do that, they can clone or fork or suggest me to add If it’s something that should be added, of course.

[00:26:51] Arfon Smith: It sounds like contributions are welcome, which is great. I was going to say, are there obvious things that you’re personally interested in extending the software with at this point? Or is it mostly sort of done for your needs and your collaborators needs?

[00:27:05] Juliette Hayer: For the needs we have now, it’s mostly done, but no, some tools, I’m already thinking about changing them or adding other options. Because I really like combining different methods for doing the same thing. And so I have at the moment two different tools for predicting the resistance genes, as I said before.

I would be happy to add new ones that are coming. But then I would need to work on harmonizing the results. Which is not necessarily easy because they use different databases, and usually genes in the databases, they can have different names and all, so that’s an all different story. But that would be fun, I think.

For the multi locus sequence typing I want to add, now I have a very basic tool that’s working on seven genes for typing the bacteria. And I want to add another tool that would use much more genes to base the typing on. So that’s the first thing I want to do.

And then there are others, of course. I would like to improve all the detection of the mobile genetic elements. That’s something I really need to work with. Investigate other tools that are to be included.

[00:28:22] Arfon Smith: Sounds like you’ve got quite a roadmap ahead of you, actually.

[00:28:24] Juliette Hayer: Yeah.

[00:28:25] Abby Cabunoc Mayes: yeah, and it also sounds a bit like the kind of contributions that would be most helpful is Other users that have their own use case and they want to add different parts of the workflow and change up a little. Is that, yeah. Cool. Awesome. So if you’re listening and you want to use Baargin, Juliette here welcomes your contribution.

[00:28:43] Arfon Smith: Sounds like it.

[00:28:44] Abby Cabunoc Mayes: Just to close us off, how can people find you online and keep up to date with your work?

[00:28:48] Juliette Hayer: So they can find me on GitHub, of course, under @jhayer , where Baargin is released, and you can find me on Researchgate and on X, and it’s Juliette underscore Hayer.

[00:29:01] Abby Cabunoc Mayes: Perfect.

[00:29:01] Arfon Smith: Awesome.

Yeah. Well, Juliette, thank you so much for coming and being part of the JOSSCast. It’s been great to talk to you today about the Software, Baargin, I’m still not saying it right, I’m

[00:29:13] Juliette Hayer: It’s okay.

[00:29:14] Arfon Smith: it’s okay, I’m getting better. Okay, Baargin and the problems it solves, it sounds incredibly relevant, I know, I worry about anti bacterial resistance. I think that’s a really important thing for humanity to be working on. I’m grateful for the work that you and your team are doing, and for publishing in JOSS.

[00:29:33] Juliette Hayer: Thank you very much for having me.

[00:29:41] Abby Cabunoc Mayes: Thank you so much for listening to Open Source for Researchers. This is our first episode that we’ve released since our launch. It’s been amazing to see the response. Thank you so much for subscribing for telling your friends for sharing on social media. We love to showcase open-source software built by and for researchers. So subscribe to hear more in your favorite podcast app.

Open Source for Researchers is produced and hosted by Arfon Smith and me, Abby Cabunoc Mayes, edited by Abby and music is CC-BY Boxcat Games.

[/expand]

Video

JOSSCast #2: Astronomy in the Open – Dr. Taylor James Bell on Eureka!

Subscribe Now: Apple, Spotify, YouTube, RSS

In this episode of Open Source for Researchers hosts Abby and Arfon explore the world of open source software in astronomy with Dr. Taylor James Bell, a BAER Institute postdoc at NASA Ames. Eureka! is an end-to-end pipeline designed for JWST (James Webb Space Telescope) time series observations. We chat about the motivations behind Eureka!, its unique features, and the democratization of exoplanet science.

Join us for an engaging conversation that unveils the complexity of time series observations, the power of open source, and the exciting future of Eureka! and JWST discoveries.

Transcript

[00:00:04] Abby: Welcome to Open Source for Researchers, a podcast showcasing open source software built by researchers for researchers. My name’s Abby

[00:00:11] Arfon: and I’m Arfon.

[00:00:13] Abby: and we’re your hosts. Every other week we interview an author published in the Journal of Open Source Software, JOSS, and this is episode two . We’re chatting with Dr. Taylor James Bell on his software Eureka! and the paper’s called Eureka!: an end to end pipeline for JWST time series observations.

[00:00:30] Arfon: So this is an exciting episode for me, we get to dig into some of the open source software that’s been built for the James Webb Space Telescope, which is a mission I used to work on in a previous life. This is a groundbreaking scientific observatory that’s in space right now and doing great science and it was really interesting to chat with Taylor about the work he’s doing with the team.

[00:00:49] Abby: Yeah, and even for me as not an astronomer who didn’t quite understand everything about exoplanets, I still really enjoyed just his experience trying to build the software in a way that is community focused and not just publishing a paper and having it go out

[00:01:05] Arfon: Absolutely. Yeah. Looking forward to jumping in.

[00:01:08] Abby: Let’s do it.. Taylor, do you want to say a little bit about yourself?

[00:01:12] Taylor: Yeah, I’m a BAER Institute postdoc here at NASA Ames in Mountain View, California. I work on studying the atmospheres of exoplanets, so planets orbiting distant stars. And to do that, I do a lot of data analysis and recently open source software.

[expand]

[00:01:29] Abby: That’s amazing.

Just to kick us off, why did you start this project, this piece of software?

[00:01:34] Taylor: it was initially started by two of my colleagues, Kevin Stevenson and Laura Kreidberg, two faculty in astronomy. and they’re both well respected experts in the analysis of Spitzer and Hubble space telescope data.

But we wanted to prepare for the JWST launch, which was coming up in about a year from then. So the two of them have written software for Hubble and Spitzer and wanted to get ready. And they wanted to do something different this time. Their code for Hubble and Spitzer, sometimes it was open source, sometimes it wasn’t, but it was never community built, community driven and really truly open source. Not just open source, but like well documented and user guides and documentation and all of these things. So they reached out to a broad group of experts in the fields to get involved and make it a large team project.

And so about a dozen of us joined the team. and just naturally through those collaborations, I rose to a kind of primary contributor level in the team.

[00:02:39] Abby: That’s awesome. and I should have mentioned before, we are talking about Eureka!.

[00:02:42] Arfon: Yeah, I was going to say this is really cool for me to hear about this piece of software, Taylor.

I used to work at Space Telescope Science Institute in Baltimore, which is where the James Webb Space Telescope is operated from. And so it’s really nice to see such a rich sort of ecosystem of open source tools that are being developed around the Webb Telescope. That seems to be turning out to be, a defining feature of this mission as compared to earlier missions.

I guess, actually other space missions have been good at this. I know Kepler was very strong, especially in the K2 phase of the mission was really strong on open source and community software as well.

[00:03:15] Abby: Yeah, so related that do you have a strong background in open source?

Or is this your first time doing something open source?

[00:03:21] Taylor: So my undergrad was in physics, but I did a minor in computer science. I took about a half a dozen classes or more. Taking, a lot of programming principles and practices and stuff like that. So I got some experience. In the kind of 2018 to 2020, I wrote two open source packages.

One of them falls nicely into the open source and well documented and, all of those things. That was a simulation tool, a pretty basic little tool to model what an atmosphere might look like. Like what an observation of a planetary atmosphere might look like. And then I also built a pipeline for Spitzer. I co developed that.

It was open source, but again, not in that regime of actually, broadlly usable by people.

[00:04:08] Arfon: I was going to ask about who your target audience is for this software. So I know a little bit about what Webb is doing and looking at the atmosphere of exoplanets. Maybe I’d be, I think, helpful to just explain a little bit about what the telescope’s doing, what type of observations it’s making, and who cares about those kind of observations? Would you mind sharing a little bit of just background. Who, uses this and why, that’s interesting?

[00:04:33] Taylor: For sure, yeah. Eureka! targets time series observations. So these are observations that, , monitor something over a length of time.

And so we’re monitoring changes in brightness. The primary use case that we’ve developed it for, which is all of our developers primary science use, is to observe exoplanets. So we look at transits, eclipses, things where we’re looking for a change in brightness of the combined system of fractions of a percent.

And we’re looking for very small features. A lot of existing institutional code is not excellent for time series observations because time series observations are very different from just taking a single snapshot of a galaxy. You have to treat the data in very different ways when you’re treating single snapshots. You really want to know how bright is every individual source absolutely. How many exact number of photons are coming from that. Whereas for time series, you want to know very precisely how that number is changing. You don’t care what the constant value is. You just care how it changes. and yeah, there’s a lot of differences there.

We built this pipeline for researchers. We’ve built it in no small part for our own use. and that’s been part of our big success is that all of us developers are very actively using the code. But really one of the driving motivations behind it was that programming ability shouldn’t define scientific success.

[00:05:59] Taylor: And so there’s a lot of, especially early career researchers and people all over the globe that want to participate in exoplanet science, but don’t have the programming ability to do to build a complete pipeline from scratch that is independently verified and things like that.

It’s a huge feat and it was something that the community struggled a lot with Hubble and Spitzer. And yeah, our target demographic really ranges from undergrad researchers working for professors and the professor gives them a single observation all the way to faculty and postdoc who just want to quickly crank something through.

[00:06:35] Abby: I really resonated with what you said about how the ability to code shouldn’t determine your ability to do research and how this open source code is really democratizing science in a way that we haven’t seen in a while.

So that’s exciting.

[00:06:46] Arfon: Yeah, I was going to ask, it’s related on that. How low is that barrier? I know a little bit about where the JWST data ends up. A lot of it is in the MAST archive at Space Telescope and I think you can get it in the European archives as well.

Can I just go download some data and run this tool? How turnkey is it for, an end user?

[00:07:04] Taylor: That’s something we’ve been struggling with in two mutually opposing ways. One is, yeah, a lot of JWST data, there’s an initial proprietary period of about 12 months or something, but then eventually it becomes public and so anyone can look at it.

Then, there is some amount of analyzed data that is archived along with those raw data, and that’s not particularly good quality at this point in time. STScI, the Space Telescope Science Institute, is working on improving that data quality. But in the meantime, it’s in a fairly rough state for time series observations.

And then for actually using Eureka! to get science data out. We’re fighting kind of two opposing directions. One is we want it to be really easy and accessible for everyone to use. No programming, obviously, but minimal barriers. And so we’re working on, making better documentation, making better templates.

So that basically you just need to specify where your inputs and outputs are and you can get a half decent first result. But something that we have learned with Hubble and Spitzer is that what works well for one observation almost never works perfectly for another observation. It might work okay, but that doesn’t mean your results are going to be ideal or even necessarily exactly right.

And so there’s some amount of fine tuned fiddling that needs to be done, and some amount of intuition that needs to be done of what do you check to make sure that what you’re getting out is reasonable. And so we don’t want people to think of Eureka! as a black box, that they just put something in, they get something out, and they’re like, okay, fine. I’ll write my paper on that.

And so one of the things that we’re doing is we’re putting out a lot of log information, but trying not to overwhelm people, just give them some heuristics of what’s going well but also plot, plot all kinds of things because humans are pretty visual creatures and then being able to just click on all the things and gain some intuition for oh, if I change this parameter, this thing changes in some desirable or undesirable way.

And yeah, just really trying to graphically demonstrate things to people.

[00:09:20] Arfon: Awesome. That’s really cool. It sounds like. I probably could, as long as the data is outside the proprietary period, I probably could take the software for a spin, but you’d need to be a little careful, which I guess makes sense. It might work, but you’d want to be careful and really validate the kind of outputs you were getting before putting those, results in a paper or anything.

[00:09:40] Taylor: Yeah, the way, I think of it is it’s the default values are great for a quick first look, Do I have an atmosphere? Did I catch a transit? Did the observations work? And then when you want to get to publication readiness, when you have to twist all those little knobs.

[00:09:53] Arfon: I actually have a side question, which is, how do you know when to look? So I think if I understand what’s right, what you’re saying is, you’re looking at an exoplanet. You’re interested in atmospheres, which means you’re looking for an exoplanet as it goes in front of the star that it’s orbiting. JWST is not looking there all the time, right?

Instead doing other things. So how do you know when to make the observation? Is that, a hard thing to figure out? Is that something that Eureka! does as well? Or is that a separate thing?

[00:10:22] Taylor: No, yeah, that’s separate. JWST for, the vast majority of its time is not, at least for exoplanet stuff, it’s not looking for anything new.

It’s, not looking for new planets. It’s looking to characterize well monitored planets so that we know the orbital period so Kepler is a great example. It’s observed a hundred transits of the thing because it has an orbit of a day and so just by staring at it for a month you catch 30 transits and you know exactly when it’s going to transit. And then ,doing some fancy fitting you can try and figure out when it should eclipse based on what you think it’s orbital eccentricity is and things like that.

Yeah, just a lot of Kepler’s laws and using the Kepler space telescope as well as excellent ground based telescopes like loss is.

[00:11:09] Arfon: Cool. Awesome. Thanks for that. So I guess one of the thing I’d be really curious to learn about is what, how does this Eureka! package differ from what’s already available? I think exoplanet science is not brand new. It’s relatively recent. Are there other tools that people use where you’re finding a gap in what was possible?

Can you say a bit more about what people might have done before Eureka! even, or what they’re doing alternatively today if people aren’t using the tool.

[00:11:34] Taylor: Yeah, so there’s numerous alternatives at present for JWST data. Most of them are developed specifically for JWST data. I find that is a very good thing, especially early on in telescope’s life, having many different people do many different things, and then we’ve worked somewhat on comparing those pipelines at the end, because, yeah, it’s great if you have multiple things, but if that means a hundred different papers get published using a hundred different pipelines, and you don’t know if you can compare planet A to planet B because they’re different so it’s great to have multiple pipelines but also need to compare them.

As for how Eureka! differs to them many pipelines are focused on just dealing with a single instrument on JWST. JWST has four primary science instruments. You have NIRCam, NIRISS, NIRSpec, and a mid infrared detector, MIRI. and within each of them, you also have multiple different filters and different observing modes.

You have spectroscopic data, you have photometric data, where you’re just taking images instead of spectra. There’s really a diverse amount of possibilities. and exoplanets uses basically all of them. Not all of the modes, obviously, but an enormous amount, and all four instruments. And at present, we’re the, one of the few that targets all four instruments with all exoplanet specific observing modes.

We’re currently missing the NIRISS instrument but we’ve just got funding just earlier this week to implement NIRISS code in there. That’s one big differentiation. Another is that it is community developed and community supported. There’s, about a dozen of us who have contributed large parts, but we get almost weekly, if not more often issues submitted through GitHub, which , allows us to help support people.

A lot of pipelines that just exist, yes, they might be hosted on GitHub, but it’s hard to get anyone to respond. Whereas there’s many of us who try and work with people, help them get up and running. and compared to what existed before JWST, it’s complex with all of these different instruments but in some ways, some of the instruments are quite like Hubble, and some of the data is like Spitzer.

And this, pipeline is kind of Frankenstein bits of Hubble and Spitzer code that was previously written by Laura, Kevin, and I as well as some other people. And so it’s drawing on all of that together. and at present, we allow Hubble and JWST data to be fitted with Eureka!. It’s my personal aspiration that someday we might add Spitzer, but Spitzer is now decommissioned, so it would mostly be for, archival research, which takes a lot of energy to build something new for something that’s just archival research.

There’s also the Space Telescope Science Institute does have its own software to support observers. One of its big weaknesses, in my mind, is lack of plots. There’s no plots that are made. You can save intermediate results, but a lot of people just want to run the pipeline and not, have to interject a bunch of things to make plots or force save every intermediate result and then plot it after.

And so that’s, one big thing that we’re doing is just really making visible all of the intermediate results. And then we also are, we call it an end to end pipeline, where we go from raw pixels to planetary spectra. And in truth, there is slightly further you can go of turning planetary spectra into okay, what’s that atmosphere made out of? Is there carbon? Is there carbon monoxide? Stuff like that.

But we do largely end to end. Stuff which is, again, not something that the STScI pipeline does, and, something that is often done by other community pipelines.

[00:15:44] Abby: That’s amazing. First off, congrats on the funding. That’s always huge to get that, especially in the research space.

I also really resonated with what you were saying about being community built I do think a lot of the power of open source comes from building with the community and with others. So how are you getting the word out to the community? Is this why you published in JOSS, the Journal of Open Source Software?

What else are you doing to bring the community together?

[00:16:05] Taylor: Yeah, it was definitely a large part of why we published in JOSS A), to make sure people heard about us, have a nice, clean way to publicize the pipeline and its state at that point. We, also were quite involved in early release science.

JWST came with early release science observations, which were community proposed observations that would test JWST and in many of the key science cases and collect some open public data for the community to learn on and do a lot of these pipeline comparisons.

And so a lot of, I think every ERS Early release science paper had Eureka! on it. and an advantage of having a dozen developers on it and also just making it open source. Several of the analyses that have been published so far are not by Eureka! developers. We’ve also been working on leading, tutorial sessions and doing workshops at the Sagan Summer Workshop held at Caltech this past summer.

We held two hands on sessions, training people how to go from pixels to light curves, and light curves to planetary spectra. And that was a free conference to attend, either in person or virtually. and so we had hundreds of people participating in that, primarily early researchers, but also some faculty and stuff like that.

And so that Sagan Summer Workshop covered the Eureka! pipeline which takes us from pixels to planetary spectra, and then two different methods of going from planetary spectra to atmospheric compositions. And yeah, just doing a lot of kind of outreach like that and then also just letting our work speak for itself. We’re comparable to a lot of pipelines, we’re easy to use. Encouraging people to check them.

[00:17:58] Abby: Yeah, that’s great. And I do love the workshop method for getting people on board. Are you finding that a lot of these users become contributors eventually or start contributing in small ways?

[00:18:09] Taylor: That hasn’t yet happened. It’s, still very early days.

We haven’t even released version 1. 0. We’re on 0. 10, I think we just released. We, have had community members contributing code. But that workshop was, only a couple months ago. We do get issues submitted by the workshop attendees and, others, which helps. But primarily it’s been us developers solving issues for people at this point and less people contributing their own solutions.

[00:18:37] Abby: Right. I do think that issues are a valid form of contribution to open source. That’s very important.

[00:18:43] Taylor: Good point.

[00:18:43] Arfon: I was actually going to say I mean Abby and I probably spend a lot of our lives doing this. Maybe you do as well, Taylor, but, when you go and look at a project open source project, you can get a sense just by clicking around if stuff’s happening, are there issues being opened and closed, are there pull request being opened and merged. And Eureka! looks great. There seems to be tons of activity on the project, which looks fun and it looks like it’s really a healthy project.

Lots of people opening issues, lots open, but lots have also been closed. It’s always a good sign to see that. yeah, congrats on a really, nice project.

I wanted to ask a little bit about the review process that you went through. So just to explain very briefly, this of course, Taylor, the JOSS review process is an open one.

It happens on a GitHub issue. A couple of people come and go through a checklist driven review process and then ask questions and suggest improvements. I know it was a bit over a year ago, so I know it’s a little while ago, but I wondered if you had any things you wanted to share about that review. Did it help you as authors of the software? What was your kind of lasting experience from that?

[00:19:45] Taylor: Before Eureka!, all pipelines that I’ve written and all pipelines that I can immediately think of that were written were never published in themselves. They were always like, I’ve got these new Spitzer data, so I wrote a pipeline. I tersely write what the pipeline does in the paper, but it really it’s a science paper. So as a result the referees of that paper really don’t focus that much on the pipeline. It’s left to the developer of that pipeline to cross validate it with other pipelines and make sure that they didn’t drop the square root of two anywhere or things like that. With JOSS, it’s still on you to not drop, square roots of two and stuff, but it was much more focused on the pipeline and on the documentation and making sure that there’s, an easy way to install the pipeline and tutorials and things like that.

Which was significantly beneficial to the pipeline. It was great to really have the spotlight on the pipeline itself.

[00:20:50] Arfon: Typically you’d have astronomers reviewing a piece of astronomy pipeline. These people could be users I guess, or proto users. You could imagine review might touch on code quality, documentation, installation. Was there a particular area where you’re like, oh yeah, this is significantly improved some outcome here, or was it just a fairly broad analysis by the reviewers?

[00:21:08] Taylor: I’d say the area in which we most improved as a result of the JOSS review is documentation. It was something we had spent a significant amount of time on, and post facto writing documentation, which is always the wrong way to do things, because it’s so miserable to read other people’s code.

Again, most of this was written for Hubble and Spitzer, many years ago, and so it’s even the people who wrote it don’t necessarily remember what they wrote. I spent a lot of time writing documentation and like trying to make it uniform. In preparation for the JOSS review, and then the JOSS reviewers pointed out even more situations where, it could be improved, which definitely made things a lot better.

I’ll admit we still have a folder of old functions that were written initially in IDL and then converted to Python, but the comments never changed. But we have 30, 000 lines of code and at least at the start we weren’t being paid for developing this, and so it was us donating our time.

Now that we have some money, we can Invest more time in this, but again, 30, 000 lines of code, you can only do so much. We really focused on the user accessible layers of things and a lot of the really deep stuff that’s five, six functions deep is where you start getting a lot of computation.

But the documentation and the the way of installing the pipeline were definitely improved.

[00:22:32] Abby: I guess related to that, this does sound like it’s one of the bigger open source projects and definitely one of the more community focused open source projects you’ve ever contributed to. And I know the JOSS review process is really focused on getting that stability or that groundwork you need to build that flourishing community.

Any big surprises come your way working open source or working openly?

[00:22:51] Taylor: I think the initial surprise that I had was, at how functional we were working with, a dozen people working on the code. There’s so many opportunities for people trying to do the same thing and, overlapping and, then you have nasty merge conflicts and stuff.

But, yeah, at the start we had the core dozen or so of us would have a Zoom meeting every week or two weeks to divide up some tasks. But then I’ve been surprised at how well things have kept running. It hasn’t fizzled out because that’s so easy to happen. A lot of projects start and then people are kind of like eh, good enough.

We definitely have less contributions because the pipeline is getting to a good state. I mean, a better state. But there’s still a lot of interest from many of the members and the community interest is ramping up as time goes and we’re starting to see more and more pull requests which is, yeah, takes the burden off of us too if people, are, writing it like they have an observation of two planets instead of one, and they have to do some special stuff to the code, and then they just contribute that, rather than us having to be like, oh, we have to build something from scratch, if people are doing this, and they do it for themselves, and then people have been very willing to just offer that back to the community in a in- kind donation, which is awesome.

[00:24:19] Abby: I love that. The piece that surprised you is just how well collaboration and open source is working. It’s perfect.

[00:24:25] Arfon: That’s nice. Yeah.

I wonder if I could take us back to the web and science just for a moment. I was curious, like the, software’s existed for a while. I think you mentioned that you built Eureka! to be ready for the early release science program.

As a bit of a space nerd, I was myself, I was curious, are there any results that you’re particularly excited about from JWST? Related to Eureka! or not, are there any things that are really memorable for you and, or, things that you’re anticipating would be, really exciting discoveries that you’re hoping to see in the not too distant future.

[00:24:59] Taylor: Yeah, I think one was all of the early release science observations. We had observations with all four instruments. I was just stunned by the beauty of the data and the remarkable quality of it. I’ve talked a lot about Hubble and Spitzer and both of them, their raw data was kind of ugly for exoplanet science.

There’s just a lot of noise and a lot of dealing with 1 percent systematics to get to 0. 01 percent astrophysical signal. Whereas here we’re dealing with noise that’s at or below the astrophysical signal. And so just the astrophysical stuff just pops up.

One example is the carbon dioxide detection on WASP 39. This was the first paper put out by the ERS team. And that’s just a great example of run it through first time, you didn’t fine tune everything and then boom, you see this feature that we’ve been looking for a long time.

That was super exciting. More recently I just had a paper come out in Nature where we used Eureka! and another pipeline to to find methane in a exoplanet atmosphere. And so this is something that we’ve been looking for a long time in transiting exoplanets. But Spitzer was photometric, and its filters were very broad, and so it averages together a lot of different wavelengths all into a single point.

And so methane would be mixed with water, and carbon monoxide, carbon dioxide, and there’s a whole big mess of things, and so really saying that there’s methane in the planetary atmosphere has been very tough, but with that paper, yeah, you could just see the bump and divot caused by methane, and yeah, that’s been really exciting too.

[00:26:44] Arfon: So this is just the higher resolution spectra is that why you can see it. It’s not convolved with some other, signature,

[00:26:51] Taylor: Higher, resolution spectra and, just higher, signal. Just JWST is so much larger.

[00:26:56] Arfon: Bigger mirror!

[00:26:57] Taylor: Big, bigger mirror, better resolution, and vastly smaller systematic noise components all together make it possible.

[00:27:07] Abby: That’s great. So it does sound like this project is open for contributions. What kind of skills do I need to jump in?

[00:27:14] Taylor: So it’s all written in Python. Yeah, we’ve considered doing some Cython to make some core things faster, but as of right now everything’s Python. So it’s primarily just Python and GitHub.

Just knowing how to use those two. Basically the only things you need to know how to contribute. There’s open source data that you can mess around with and try things out and see what works for you, what doesn’t. We do have a community policy to make sure that your contributions are appreciated and respected and that everyone else is also respected.

Yeah, it’s, basically GitHub, Python and then some time.

[00:27:51] Abby: That’s great. and any any scientific knowledge needed?

[00:27:54] Taylor: We, try to be somewhat pedagogical with some of the stuff. Especially the, first four, Eureka!’s broken into six stages. The first two stages are just really dealing with nitty gritty JWST specific pixel stuff.

And for that we rely on the software built by STScI, their JWST pipeline. But then we tweak it in some ways. Then the middle stages are data analysis. They’re not really science focused. It’s like how do you go from pixels to a spectrum.

And then stage five is where you get into a little bit of science. What does exoplanet transit look like? And there’s great open source packages that we’ve been using for that. And then stage six is just stapling together a hundred different wavelengths, just put them all into one section. So yeah, fairly minimal requirements for learning the science behind it.

[00:28:47] Arfon: Cool.

I think I’ve stared at bits of the JWST pipeline in the past, so maybe I’ll find some of that code in there too. So what’s the best way that people can stay up to date with your work, Taylor? It sounds like you had a Nature page recently, so congrats on that. But how can people find you online and, keep up to date with Eureka! and its latest releases?

What’s the best way to for folks to keep track?

[00:29:08] Taylor: Probably the best way is my personal website, www.taylorbell.ca. I have all my socials and stuff linked on there and I try to keep it up to date. There’s some blog posts of some science I’ve done and many of them are just placeholders where I’ll eventually write stuff, but yeah. That’s a good place to keep track on me and find my GitHub and things like that, too.

[00:29:32] Arfon: Cool. Thanks. And future releases of Eureka!? Is GitHub the right place to keep track of that as well?

Is that where you would release new versions?

[00:29:40] Taylor: Yeah, yeah, we just released version 0. 10 probably two weeks ago, and we’re working actively on version one right now where we’ll not really add any new features, but just make the maintenance a lot more sustainable on us and just reorganize things and do some backwards compatibility breaking to make things easier for users and developers.

[00:30:00] Abby: Nice. So that’s on GitHub. That’s taylorbell57, I believe. We’ll link all this in the show notes so people can go find those. But thank you so much, Taylor, for joining us. This is a great conversation.

[00:30:10] Taylor:

Thanks so much for having me.

[00:30:12] Abby: Of course.

[00:30:12] Arfon: Yeah, and congrats on a really nice piece of software. It looks really great. I wish you all the best and I hope, there are many future releases. it looks really great. So thanks again for telling us about your package.

[/expand]

Video

JOSSCast #1: Eva Maxfield Brown on Speakerbox – Open Source Speaker Identification for Political Science

Subscribe Now: Apple, Spotify, YouTube, RSS

In the first episode of Open Source for Researchers, hosts Arfon and Abby sit down with Eva Maxfield Brown to discuss Speakerbox, an open source speaker identification tool.

Originally part of the Council Data Project, Speakerbox was used to train models to identify city council members speaking in transcripts, starting with cities like Seattle. Speakerbox can run on your laptop, making this a cost-effective solution for many civic hackers, citizen scientists, and now .. podcasters!

From the advantages of fine-tuning pre-trianed models for personalized speaker identification to the concept of few-shot learning, Eva walks us through her solution. Want to know how this open source project came to life? Tune in to hear about Eva’s journey with Speakerbox and publishing in JOSS!

Transcript

[00:00:00] Arfon: Welcome to Open Source for Researchers, a podcast showcasing open source software built by researchers for researchers.

My name is Arfon.

[00:00:11] Abby: And I’m Abby.

[00:00:12] Arfon: And we’re your hosts. Every week, we interview an author who’s published in the Journal of Open Source Software. This week, we’re going to be talking to Eva Maxfield Brown about their software, Speakerbox, few shot learning for speaker identification with transformers.

[00:00:27] Abby: Yeah. And I think this was a great interview to kick off with. Eva was really excited to talk about her work. And I thought it was very applicable for podcasting!

[00:00:35] Arfon: Absolutely. And, I don’t think I said it at the start, but this is our first ever episode, so we’re really excited to start with this really interesting piece of software.

[00:00:44] Abby: Awesome, I guess we can just dive right in.

[00:00:46] Arfon: Yeah, let’s do it.

Welcome to the podcast, Eva.

[00:00:49] Eva: Hi, how are you?

[00:00:50] Arfon: Great. Thanks for coming on.

[expand]

[00:00:52] Abby: Yeah, glad to have you in one of our early episodes.

But do you want to tell us a little bit about yourself just to kick us off?

[00:00:58] Eva: Yeah, sure. My name is Eva Maxfield Brown. I’m a PhD student at the University of Washington Information School. My research primarily focuses on the science of scientific software more generally, but I also like to believe that I practice building scientific software. So I have experience building software for computational biology or microscopy and also political and information science.

The project that we’ll be talking about today falls into that political and information science capacity of building not just studying.

[00:01:27] Arfon: Awesome.

Tell us about Speakerbox. Tell us about the project. This is a paper that was reviewed in JOSS about a year ago now, a little bit over, or initially submitted. I’m just curious, if you could tell us a little bit about the project, why you started it, and what kinds of problems you’re trying to solve with Speakerbox.

[00:01:43] Eva: Yeah. I’ll have to back up for a little bit. Speakerbox is part of a larger ecosystem of tools that fall under this project that we were calling Council Data Project. This was a system that basically was like, Oh, what’s really hard for political scientists is studying local government and one of the reasons why it’s hard to study local government is most of the stuff you have to do is qualitative.

It’s very hard to get transcripts. It’s very hard to get audio. It’s very hard to get all these different things in a standardized format for all these different councils. So, Council Data Project started a long time ago and it tried to lay the groundwork of getting all those transcripts and systematizing it in a way.

But one of the longest requested features of that project was being able to say that each sentence in a transcript was said by who, like this sentence was said by Council Member A and this sentence was said by Council Member B and so on. And so there’s a problem here, right?

We could spend the time and resources to build and train individual speaker identification models for each city council. But that’s a lot of time and resources that we don’t really have access to both as researchers, but also just as people interested in contributing to the local government area.

And so Speakerbox really came out of that problem of how do you quickly annotate and train individual speaker identification models for application across a transcript, across a timestamped transcript.

[00:03:00] Abby: Yeah, reading this paper I got really excited as a nascent podcaster. So I could see how this is immediately applicable because I have this audio files, and can we identify who’s speaking with what. Anyways, I could see how this could be useful.

[00:03:14] Arfon: Yeah, maybe we could, maybe we should try it

[00:03:17] Abby: Yeah, we can. On this episode.

[00:03:19] Eva: There was, I think there was one of the early GitHub issues that we got after publishing the paper was from someone who was recording their research lab’s meetings and basically wanting to train a model just to tag their own lab members. And I was like, I guess that’s a use case. Sure. That makes sense.

[00:03:35] Abby: We can train it to tag ourselves as hosts identify a mysterious third person each episode.

[00:03:41] Arfon: Yeah, there you go

I was going to say, so, the Council Data Project, could you say a bit more about who the users are, who the target audience here is? Is that like studying government and how councils work? Who is your audience?

Who would use Speakerbox?

[00:03:55] Eva: Yeah. So that’s a really good question. So I would say the larger ecosystem of Council Data Project would fall into political science, especially the subdomain of urban or city level, municipal level political scholarship. Then there’s also information science questions.

And I think there’s a lot of really interesting information science questions, such as how do ideas and misinformation and disinformation enter city councils and so on. So I think those are two very core audiences of that research, but there’s also the public side.

So there’s just general users, right? Council Data Projects comes with a website and searchable index that you can search through all the meetings that we’ve transcribed and everything. So we do have general citizen users. We also have people at the cities that we support like city employees as users as well. And journalists.

So there’s a couple of different people that are interested in it from the Council Data Project angle. But specifically for Speakerbox, as we just talked about I think there’s a number of cases where this is just useful. So that was one of the reasons why we split it out as its own thing . Certain tools for Council Data Project will likely be hard coded towards our infrastructure or towards our technology stack. But, Speakerbox is just this general toolbox of quickly how to train a speaker identification model has a broader application set.

So things like podcasting, things still in the domain of journalism or in the domain of political science, sure, but also in music or in anything else that kind of deals with this very fast audio labeling type of dilemma.

[00:05:21] Abby: Yeah, another thing that really struck me reading the paper and also just hearing you talk about the Council Data Project plus Speakerbox. There’s a lot of open source and open data tools and resources that you’re using and you’re drawing from.

Did that really affect why you’re doing this openly? Can you just tell me about the whole ecosystem?

[00:05:38] Eva: Yeah. that’s a good question. I think from the start we positioned. Council Data Project in the genre or domain of the prior, past era of open data and civic collaboration type of, angle. We originally envisioned that a different group would be able to deploy all of our tools and technology for their city council, wherever they are.

That would just reduce the cost on us. And so that was partly the reason why moving to the open source model was really nice. All the documentation can be available if they need to make an edit for their own city, whatever it is, they can contribute back. There are also existing projects prior to us, in the history of this civic information landscape that did this pattern. But I think we’ve shifted it to our own needs a little bit. That said, the fast pace of machine learning and natural language processing tools is just so quick that you could expect a new version of some package came out and there’s going to be, a breaking change.

And to be honest, at this point, I work on a lot of different projects and it’s really nice to push it open source just because I know that there are other people, whether they’re related to Council Data Project or not, that are also going to use it and need it and can contribute to helping fix and maintain and sustain the project itself.

[00:06:47] Arfon: So are there really obvious alternatives? I understand why you would want to publish this. There’s lots of opportunity for pushing things into the public domain but are there tools that other people were using previously that you’ve found efficient or are maybe using I don’t know, proprietary languages or something.

What were the motivations for publishing a new tool specifically?

[00:07:07] Eva: I think there definitely are proprietary technologies that allow you to do this. I think maybe Google is probably the most famous. I think they have a whole suite of give us your data set and we will try to train a model for it.

Other companies as well. I think Microsoft probably has their own version of that. Amazon as well. So there’s definitely proprietary kind of versions of this. I think what I particularly wanted to do was: One, this is something that, again, focusing on specifically our core user base, which is that civic technology type of person, they don’t typically have access to a lot of money to pay Google or to pay Amazon or whatever but they might have access to their laptop and that has a decent CPU, or they might have, they might even have a desktop that has a GPU or something.

And so I wanted to make it possible where it’s like we can, quickly annotate, like we can demonstrate exactly how to annotate this data and also demonstrate using consumer grade GPU. I think the GPU that I used in the example documentation for all of this is five, six years old or something at this point.

And show that it still works totally fine and it trains in under an hour. Everything is able to be done and let’s say a weekend of time. And I think that was really needed. I’ve already heard from other people, whether it’s, other research labs or other people trying to deploy all of our Council Data Project infrastructure that they are able to quickly iterate and train these models just on their own local machine.

[00:08:33] Arfon: Yeah, that’s really cool. You mentioned civic technology. I think I heard you say civic hacking space before. I think that the idea of publishing tools for communities that don’t have tons of resources and don’t necessarily, aren’t able to pay for services is really important.

And I was fortunate to be in Chicago when the civic, hacking scene was really starting up in the late

[00:08:54] Eva: Oh,

[00:08:55] Arfon: 2000, Yeah, about 2009 2010, and that was a really vibrant community, and I think still is, and there’s just lots

[00:09:01] Eva: Yeah. Chi Hack Night is

[00:09:03] Arfon: Chi Hack Night, there you go, yeah, lots of open, civic tech is, it’s a really cool, space, yeah, cool, okay.

Okay, so I can run this tool on, a relatively modest GPU, but what’s actually happening under the hood?

So there’s this transformer thing, there’s some learning that happens, what, does it look like to actually train Speakerbox for your project, for your dataset?

[00:09:27] Eva: Yeah. fortunately for us, Transformers is I think two things. So when people say the word Transformers, it’s, a couple of things all in one. There’s Transformers the architecture, which I think we can maybe shove aside and say it’s the kind of quote foundation of many modern contemporary machine learning and natural language processing models.

It’s very good at what it does and it works based off of trying to look for patterns across the entire sequence, right? So looking for whole sequences of data and saying where are these, things similar? So that’s great. But there’s also Transformers, the larger Python package and ecosystem, which is run by a company called Hugging Face.

And that’s, I think, where, at least now, honestly, maybe most people think of oh, I’m using Transformers. Specifically in my case, and in the Speakerbox case, we built it off of Transformers because there are tons of existing models that we can use to make this process of speaker identification or speaker classification much, much, easier.

Specifically there are models already pre trained and available on I think it’s called the Vox Fox something data set. I forget what the name is. But the general idea was that, people have already trained a transformer model to identify something like 1000 different people’s voices.

And to my understanding, they’re all celebrities, right? So you can imagine this. training process as, okay, given a bunch of different audio samples of celebrities voices, can we identify which celebrities voices these are? In our case, we don’t care about the celebrities, right? For many, if you want to train a model for your lab or you want to train it for a city council meeting or whatever it is you care about the people in your meeting.

And so the process there is called fine tuning, where really the main task is you’re taking off that last layer of the model and instead of saying, I want you to pick between one out of a thousand celebrities, I want you to pick out of one out of in this call, I want you to pick out of one of three people, right?

And by doing so, you can quickly give it some extra samples of data. So we can give it a couple, let’s say 20 samples of Arfon speaking. We can give it 20 samples of Abby speaking. We can give it 20 samples of me speaking. And hopefully just because of the existing pre training it, it’s learned the basics of how people speak and how people, converse.

And all we’re really doing is saying. Don’t care about the basics of how people speak and how people converse. Really focus on the qualities of this person’s voice and the qualities of this person’s voice, and so on. Hopefully that answers the question.

[00:11:50] Arfon: is that why, is that, those examples, is that what we mean by the few shot learning? So you give a few examples, effectively, of labeled This is Abby speaking, or this is Eva speaking. is that what we mean by the few shot?

[00:12:03] Eva: Exactly. In our case we’re doing a little bit of a hack. We say few shot learning, but it’s actually not really because it’s very common for few shot learning to literally be five examples or something less than 10. But really because audio samples can become really long, like I’m answering this question in a one minute phrase or a one minute response we can split up that audio into many, smaller chunks that act as potentially better smaller examples.

And so you may give us five examples, but in the process of actually training the model in Speakerbox, we might split that up into many smaller ones. So we’re kind of cheating, but,

[00:12:38] Arfon: Interesting. Thanks.

[00:12:40] Abby: Yeah, that’s awesome. And I do really like the, the Hugging Face platform, where there’s so many models that you can tweak. Are you sharing back to the Hugging Face platform at all? Or,

[00:12:50] Eva: I haven’t, I’m not. We have used some of the models trained from Speakerbox in our own work with Council Data Project specifically for the Seattle City Council, because that’s, my home council. And we’ve trained a model to identify city council members for Seattle, and we’ve applied it across a number, I think, almost 300 transcripts.

But that was just purely for analysis. We haven’t pushed that model up or anything, and we probably should. That should probably even be part of the process or part of the pipeline. Yeah, to come. I’m sure other people who may have used the system might have or something, but,

[00:13:20] Abby: you did publish in JOSS, the Journal of Open Source Software. Can you talk a bit about why you decided to publish there, and how the experience was?

[00:13:27] Eva: Yeah. So to be honest, I love JOSS. This is my second article in JOSS. We’ve also published one for the larger ecosystem of Council Data Projects as well. I think after the first time publishing with JOSS, I was just really, really, impressed with the reviewing process. More so than some other journals, right?

I think the requirements of having reviewers try to use your package that you’re pushing, saying this is ready for use, or this is useful in X domain. I think having reviewers try to use it is such a It’s such a good thing to do simply because one, like we know that it’s available, we know it’s documented, but also we, know that it works and I think I’ve just been frustrated by the number of times I’ve seen a paper on arxiv or something that links to their code, but there’s no real documentation or something.

And so, to me, it’s like I don’t know, I, just, one, like to keep supporting what JOSS does because I really appreciate, I’ve, read a number of papers from JOSS where I’m like, that tool seems great, I’m going to go use it, right? Yeah, I don’t know, it just, it seems like the default place in my mind now, where if I want, to come off as like a trusted developer of software or a trusted developer of research software, that’s the place to go.

[00:14:35] Arfon: Yeah, and it’s funny you say having people actually use your software. This is one of the founding questions that a few of us were asking, which was, it was possible to publish a paper about software in some disciplines, but the question I was asking people on Twitter at the time, I’m not really very active there now, but was like, when you have managed to publish a paper about software, did anybody look at the software?

And people were like, nope. Universally, nobody ever said yes. And I was like, huh, that seems really weird to me, that you would publish software, but nobody would look at the software, that seems. I’m glad that resonates for you and provides value.

[00:15:17] Eva: I was gonna, I was gonna add, I think the open reviewer process is very, very, nice. Because at least when I was writing the first paper that we wrote for JOSS, I could go and look at, other review processes and say, okay, what are they expecting? What might we need to change ahead of time, etc. And I could just, prepare myself, especially as like a, now a PhD student, sometimes publishing is scary and being able to look at a review process drastically lowers that concern or that worry as well.

[00:15:47] Arfon: Yeah, definitely. I mean, it’s good and bad with open review. It’s good for transparency and people are generally very well behaved. It’s bad because it’s, I think, amongst other things, intimidating is probably, it could be exclusionary to some folks who don’t feel like they could participate in that kind of review.

I’m, but on balance, I feel like it’s the right decision for this journal. I definitely think it’s an important part of what JOSS does.

I was going to ask if there was anything it sounds like you’d eyeballed and seen some reviews before you’d submitted, but was there anything particularly important about the review for Speakerbox that was a particularly good contribution from the reviewers or anything that was surprising, , anything particularly memorable from that

review?

[00:16:33] Eva: Speakerbox , because it’s designed to act in a workflow manner, it’s not so much, a library of here’s a single function or here’s like a nice function, two or three functions to use for your processing. It’s very much designed as a workflow, right?

Where you say, I have some set of data. I need to annotate that data first and then I need to, prep it for training. And then I need to finally train the model. And I might need to evaluate as well. And because it’s laid out in that workflow kind of style. The reviewers were having trouble from a normal package where you say, I’m just going to go to the readme, look at a quick start, and use it.

And I think that, honestly, the biggest piece was just, the reviewers just kept saying, I’m a little confused as to how to actually use this. And ultimately I very quickly made a video demonstrating exactly how to use it and I think that was the best contribution was just going back and forth and saying okay, I see your confusion here.

I see your confusion here. And ultimately leading me to just be like, okay, I’m going to take everything that they just said, make a video and also take any of the things that I did in the video. And add them to the readme, right?

By literally forcing myself to do it on a toy data set, I can add all the little extra bits of places that were possibly confusing for them as well. Using my own data versus using some random other data, there’s always going to be, things that you forget to add to the readme or whatever it is.

[00:17:46] Abby: That’s awesome. And one thing I really, I do like about the JOSS Review System is just how it makes, it just adds so many best practices for open source and makes it so much easier for others to, to run it and to contribute. So is Speakerbox open for a contribution if people wanted to jump in? Can they?

[00:18:02] Eva: Yeah, totally. Speakerbox. Yes, I think there are some existing minor, little bugs, I think available. There’s also probably tons of optimizations or just UX improvements that could definitely happen. Yes, totally open for contribution, totally happy to review PRs or respond to bugs and stuff.

[00:18:20] Abby: That’s great, especially since you published this a while ago, that you’re still being responsive there on GitHub. But what skills do people need if they want to jump in?

[00:18:28] Eva: I think the primary thing, there’s maybe two areas, one is, I think a very good background in let’s say data processing and machine learning in Python. Like those two things together as a single trait might be nice. Or on the opposite end of things, if you just want to try and train your own, like just use the system as is. You might experience bugs or something and just making a documentation post, right? like say "I experienced this bug, here’s how I got around it". Or, posting a GitHub issue and saying, there’s this current bug for someone else to gather. I think truly trying to fix stuff if you have a machine learning background or just trying to build a model are both totally welcome things.

[00:19:06] Abby: That’s awesome. Yeah, and I love how you’re seeing like using the model as a contribution, and just like documenting what you run into.

[00:19:12] Arfon: So I was gonna I’m gonna put my product manager hat on for a minute and ask you what questions should we have asked you today that we haven’t done yet?

[00:19:21] Eva: Woo.

[00:19:21] Arfon: Okay if there isn’t a question, but it’s the magic wand question for, what, could this product do if we if it could do anything? What should we have asked you as hosts that we didn’t, that we didn’t check in with you about today?

[00:19:35] Eva: I think it’s not so much a question, maybe what’s like the feature that you dream of,

[00:19:39] Arfon: Yeah,

[00:19:40] Eva: if, it’s true, if it’s truly like a product manager hat on then I think the feature we’ve previously talked a lot about okay, the, workflow itself is pretty simple and there are pretty good packages nowadays for shipping a webpage via Python. And it’d be very nice to just have the entire workflow as a user, like as a nice user experience process, like you you launch the server on your terminal and then you can do everything in the webpage especially for the users that aren’t as comfortable in the terminal, right?

That would be very, very, nice. And something that I just never got around to. it’s not my priority. But something that we’ve thought about as it would definitely reduce the barrier of use or the barrier of effort or whatever.

[00:20:21] Abby: Yeah. Oh, that’s really cool. Especially in the civic tech space, I can see that being like really game changing for a lot of cities.

[00:20:28] Arfon: Yeah.

[00:20:28] Eva: Yeah. I think civics and journalists as well, I think. But yeah.

[00:20:32] Arfon: Yeah. Very cool. Cool.

Hey, Eva, this has been a great, conversation. Thanks for telling us about Speakerbox. Just to close this out, I was curious how can people follow your work, keep up to date with what you’re working on today? Is there any place you would want people to go to keep track of what you’re working on?

[00:20:49] Eva: Yeah my website is evamaxfield.github.io. I will occasionally post updates on papers and stuff there. I think Twitter might be the best place you can find me on Twitter at @EvaMaxfieldB as well. So

[00:21:02] Arfon: Fantastic. thanks for being part of the podcast. It’s been really fun to learn about your software and thanks for your time today.

[00:21:09] Eva: yeah, thanks for having me.

[00:21:10] Abby: Thank you so much for listening to Open Source for Researchers. We love to showcase open source software built by researchers for researchers, so you can hear more by subscribing in your favorite podcast app. Open Source for Researchers is produced and hosted by Arfon Smith and me, Abby Cabunoc Mayes, edited by Abby, and the music is CC-BY Boxcat Games.

[/expand]

Video

Introducing JOSSCast: Open Source for Researchers 🎉

Subscribe Now: Apple, Spotify, YouTube, RSS

We’re thrilled to announce the launch of “JOSSCast: Open Source for Researchers” - a podcast exploring new ways open source can accelerate your work. Hosted by Arfon Smith and Abby Cabunoc Mayes, each episode features an interview with different authors of published papers in JOSS.

There are 3 episodes available for you to listen to today! This includes “#1: Eva Maxfield Brown on Speakerbox – Open Source Speaker Identification for Political Science” and “#2: Astronomy in the Open – Dr. Taylor James Bell on Eureka!” along with a special episode #0 with hosts Arfon and Abby.

Tune in to learn about the latest developments in research software engineering, open science, and how they’re changing the way research is conducted.

New episodes every other Thursday.

Subscribe Now: Apple, Spotify, YouTube, RSS

Call for editors

Once again, we’re looking to grow our editorial team at JOSS!

Since our launch in May 2016, our existing editorial team has handled nearly 3000 submissions (2182 published at the time of writing, 265 under review) and the demand from the community continues to be strong. JOSS now consistently publishes a little over one paper per day, and we see no sign of this demand dropping.

New editors at JOSS are asked to make a minimum 1-year commitment, with additional years possible by mutual consent. As some of our existing editorial team are reaching the end of their term with JOSS, the time is right to bring on another cohort of editors.

Background on JOSS

If you think you might be interested, take a look at our editorial guide, which describes the editorial workflow at JOSS, and also some of the reviews for recently accepted papers. Between these two, you should be able to get a good overview of what editing for JOSS looks like.

Further background about JOSS can be found in our PeerJ CS paper, which summarizes our first year, and our Editor-in-Chief’s original blog post, which announced the journal and describes some of our core motivations for starting the journal.

More recently we’ve also written in detail about the costs related with running JOSS, scaling our editorial processes, and talked about the collaborative peer review that JOSS promotes.

How to apply

Firstly, we especially welcome applications from prospective editors who will contribute to the diversity (e.g., ethnic, gender, disciplinary, and geographical) of our board.

✨✨✨ If you’re interested in applying please fill in this short form by 6th November 2023. ✨✨✨

Who can apply

Applicants should have significant experience in open source software and software engineering, as well as expertise in at least one subject/disciplinary area. Prior experience with peer-review and open science practices are also beneficial.

We are seeking new editors across all research disciplines. The JOSS editorial team has a diverse background and there is no requirement for JOSS editors to be working in academia.

Selection process

The JOSS editorial team will review your applications and make their recommendations. Highly-ranked candidates will then have a short (~30 minute) phone call/video conference interview with the editor(s)-in-chief. Successful candidates will then join the JOSS editorial team for a probational period of 3 months before becoming a full member of the editorial team. You will get an onboarding “buddy” from the experienced editors to help you out during that time.

Dan Foreman-Mackey, Olivia Guest, Daniel S. Katz, Kevin M. Moerman, Kyle E. Niemeyer, Arfon M. Smith, Kristen Thyng