Journal of Open Source Software Blog

8 Feb, 2024
josscast

JOSSCast #3: Studying Superbugs – Juliette Hayer on Baargin

Subscribe Now: Apple, Spotify, YouTube, RSS

Juliette Hayer joins Arfon and Abby to discuss Baargin, an open source tool she created to analyze bacterial genomes, especially those resistant to antibiotics.

Juliette is a PhD Researcher at the French Research Institute for Sustainable Development (IRD, Institut de Recherche pour le Développement), at the MIVEGEC research unit, where she implements computational biology methods for bacterial genomics and metagenomics to understand the circulation and transmission of antimicrobial resistance.

You can find Juliette on GitHub (@jhayer), ResearchGate, and X (@juliette_hayer).

Episode Highlights

[00:02:21] Introduction to Baargin: Juliette explains that Baargin stands for Bacterial Assembly and Antimicrobial Resistance Genes Detection in Nextflow. She developed it to analyze the genomes of drug-resistant bacteria in various environments.
[00:06:20] Multiplex Sequencing: Juliette discusses the challenge of assembling genomes for multiple strains simultaneously using high-throughput sequencing technologies.
[00:07:21] Next-Gen Sequencing and Assembly: The conversation delves into next-generation sequencing, the assembly of short reads, and the emergence of long-read technologies for comprehensive genome analysis.
[00:09:59] Target Audience: Juliette identifies microbiologists as the primary audience for Baargin, emphasizing its user-friendliness for researchers producing genome data.
[00:12:50] Nextflow in Bioinformatics: Juliette explains the role of Nextflow in bioinformatics and its popularity, highlighting its benefits for scalable and reproducible workflows.
[00:17:03] Open Source Philosophy: Juliette shares her commitment to open source principles, advocating for transparency, reproducibility, and collaborative contributions in research.
[00:19:20] Research Using Baargin: Juliette discusses her published studies, including the identification of drug-resistant E. coli transmission in Chile and ongoing projects in Vietnam and Cambodia.
[00:20:14] Publishing in JOSS: Juliette describes the benefits of publishing in the Journal of Open Source Software (JOSS), emphasizing the focus on code and transparent review processes.
[00:23:27] Documentation Importance: The hosts discuss the significance of documentation in software development, with Juliette highlighting its critical role in ensuring usability.
[00:26:03] Contributions and Skills: Juliette welcomes contributions to Baargin, mentioning that comfort with git and Nextflow is essential for potential contributors.
[00:28:27] Future Roadmap: Juliette outlines plans for extending Baargin, including adding tools for predicting resistance genes, improving detection of mobile genetic elements, and enhancing multi-locus sequence typing.

Transcript

[00:00:05] Arfon Smith: Welcome to Open Source for Researchers, a podcast showcasing open source software built by researchers for researchers. My name’s Arfon.

[00:00:12] Abby Cabunoc Mayes: And I’m Abby.

[00:00:13] Arfon Smith: And we’re your hosts. The way this works is that every other week we interview an author published in the Journal of Open Source Software and talk about their work. Today we talked with Juliette about Baargin, this software that’s responsible for assembling genomes for these bacteria that have resistance to drugs, antibiotics.

When I choose to worry about the world, this is the thing I worry about. The fact that we have these bacteria that are increasingly becoming resistant to antibiotics. It seems like a really important piece of software for potentially the future of humanity.

[expand]

[00:00:46] Abby Cabunoc Mayes: Yeah, no, I definitely agree. Whenever I’m taking antibiotics, I get worried. It’s like, do I really need this? . I don’t help more superbugs come around.

Yeah.

[00:00:54] Arfon Smith: for sure . And also, I had a year back in 2007 to 2008 working in Bioinformatics Institute. I think you’ve spent time in bioinformatics too.

[00:01:04] Abby Cabunoc Mayes: Yeah, yeah, I actually have a degree in bioinformatics and then I joined bioinformatics lab. I worked there for about five years. So I probably should have known a bit more about bioinformatics when we were talking with Juliette. I also realized I’m still calling it next gen sequencing.

It’s been a decade. Is this still the next gen? Who knows?

[00:01:21] Arfon Smith: yeah, yes, but we’ll forgive you. I think fine. Yeah, it was called NGS, right? I think as I remember it as well.

[00:01:29] Abby Cabunoc Mayes: Now it’s high throughput sequencing, is what she said. probably the more modern way to talk about it.

[00:01:36] Arfon Smith: So the sort of TLDR for today’s episode is, existential risk to humanity –pay attention , listen to two out of date people talking about bioinformatics, and updating their working vocabulary on the topic.

[00:01:48] Abby Cabunoc Mayes: It was great hearing her experiences both in JOSS, but also just creating these workflows for other bioinformaticians to use.

[00:01:54] Arfon Smith: For sure. Shall we jump into the conversation?

[00:01:57] Abby Cabunoc Mayes: Let’s do it.

[00:01:58] Arfon Smith: This is episode 3 and we’re talking with Juliette Hayer about their paper Baargin and Nextflow Workflow for the Automatic Analysis of Bacterial Genomics Data with a Focus on Antimicrobial Resistance.

That is a long sentence but it is a great paper so we’re going to talk about it. Juliette is a PhD researcher at the French Research Institute for Sustainable Development, and welcome to the podcast, Juliette.

[00:02:21] Juliette Hayer: Thank you very much for Inviting me.

[00:02:24] Abby Cabunoc Mayes: Of course. Just to dive right in, so I know before we started recording, you told us how to pronounce Baargin. Can you tell us what that stands for? I know it’s an acronym, and maybe a bit about why you started it?

[00:02:36] Juliette Hayer: Yeah. So the acronym to start with is

[00:02:39] Abby Cabunoc Mayes: Editing Abby here. So we lose Juliette’s audio for just a split second and she’s explaining the acronym. So Baargin B A A R G I N stands for Bacterial Assembly and Antimicrobial Resistance Genes detection in Nextflow. I’ll let Juliette continue to explain why she started the project.

[00:02:56] Juliette Hayer: But the way I started is that in the framework of my research program at IRD, so at the MIVEGEC research unit where I work, part of my work is to implement bioinformatics tools to investigate the circulation of antimicrobial resistance bacteria and antimicrobial resistance genes between human, animal and the environment.

I collaborate a lot with other researchers from Southeast Asia, Africa, and also South America. And so in this context, I developed Baargin.

You may know some bacteria that you can actually find everywhere, like the Enterobacteria. Some are very famous, like Escherichia coli E. coli also called. They can be found in any kind of environment and some of them can become pathogen. For human or for animals. And some of them can also become highly resistant to drugs through antibiotics. So that’s what we study. And, it is actually also the overuse and the misuse of the antibiotics that’s leading to this high amount of resistance among the bacteria.

And, it’s maybe worth, noting that WHO has said it’s a major problem for public health this, antimicrobial resistance. So in some of the research that we do, with the collaborator in Southeast Asia, Africa, and South America. We use genomics and metagenomics approaches to investigate the circulation of these bacteria and their resistance genes.

So the genome is all the genetic material of the bug, of the bacteria, and that’s what we study, and we try to sequence with high throughput sequencing technologies. And within this kind of project, we are producing quite a large amount of data that needs to be analyzed, and we need to compare these different bacterial strains, and that’s why I started with Baargin. I wanted to develop a bioinformatics workflow that would be easy to use, flexible, and highly scalable to be able to analyze hundreds of different strains at the same time. So that’s how I started.

[00:05:14] Abby Cabunoc Mayes: So when you’re actually doing the sequencing, you’re putting hundreds of strains all at once that get sequenced all together. Is that correct?

[00:05:21] Juliette Hayer: exactly.

[00:05:21] Abby Cabunoc Mayes: Yeah, that’s very different than what I’ve seen before. I haven’t done bioinformatics in a long time, but usually it’s one specimen, yeah.

[00:05:31] Juliette Hayer: exactly. So now you can multiplex, we say, put several strains at the same time on the same flow cell for sequencing and you get a very high throughput of data. So then you, of course, have to split what belongs to who , but you can sequence many at the same time.

And currently in the project. One project in Cambodia that I’m working with, we have about 700 genomes of bacterial strains, mainly Enterobacteria. So that’s also why we needed something like this.

[00:06:04] Abby Cabunoc Mayes: Yeah, that’s really interesting because even just with regular high throughput sequencing, it’s a challenge to put the genome back together again. The genome assembly is still tricky, but here it’s just another layer where you’re putting together multiple different genomes.

[00:06:19] Juliette Hayer: So you do it in parallel, yes. The genome assembly is the step that takes the most computing resources of course, but we have now very nice tools that have been developed for doing that and that are quite efficient and do not use that much as they did before.

The one I have included in Baargin is called SPAdes. It’s one of the most famous for assembling bacterial genomes. It’s very powerful, but, the thing with Nextflow, the workflow manager that I’ve used to develop Baargin is that it can parallelize the job for like hundreds of strains at the same time.

So that’s what’s cool. Depends also on your hardware as well.

[00:06:59] Abby Cabunoc Mayes: Yeah, and just for anyone listening who’s unfamiliar with next gen sequencing, and you can correct me if I’m wrong, Juliette, but there, they, instead of just reading a genome one letter at a time, it like splits it all up into tiny pieces, sequences them all, and then tries to stitch them all back together.

So it’s a fun computational challenge, I think. but yeah.

[00:07:20] Juliette Hayer: Correct.

[00:07:21] Arfon Smith: So, both Abby and I, I think, have been in past lives worked at Bioinformatics Institute. So I actually worked for a year, at the Wellcome Trust Sanger Institute in Cambridge, which was the, one of the places that sequenced the original human genome. So I know a little bit about what were called next generation sequences back in 2008 or something, I guess.

I was curious, what’s the hardware that you’re using here? Is it those very short read sequences that are being assembled together? Or is it, I know there’s some like nanopore stuff that was sort of magical future looking stuff. Like, what’s the actual tech that’s running under the hood here?

[00:07:59] Juliette Hayer: Yeah. So we actually use both. We use basically the short reads technology, which is now, the market is led by Illumina still. The length of the sequences that are produced is usually 150 base pair. So you get two sequences for the same basic sequence, and then you get a lot of them. So that’s very high throughput, which is very, very nice.

But then you need to get the assembler reconstruct the longer pieces, which we call contigs. the longer sequences. And sometimes it can be difficult. Most of the time, you cannot get the full chromosome of the bacteria in one go with this. So now they develop the long read technologies like, PacBio and Oxford Nanopore technologies.

And we also use that because that is very amazing to get the full structure of the chromosome or of the plasmids as well, which are the small circular pieces that also run into bacteria and usually carry a lot of resistance genes. So it’s also nice to get their structure fully. And the best of the best is to combine both.

So we get the very high quality because of the high throughput of the short that we can map on the long reads. So align, and then you get a perfect resolution of the genomes. And Baargin can take both, either only short reads, or it can take short and long reads as an input, or the already assembled contigs if you wanted to assemble it by yourself before and just go with the rest.

[00:09:38] Arfon Smith: That sounds really powerful. It sounds like you’ve thought about all these different technologies and made it very flexible for different scenarios. Who’s your sort of proto typical user? Is it a researcher or would you expect a sort of analyst or an engineer to run this code?

Who do you find uses your software most?

[00:09:56] Juliette Hayer: So my very first audience are my colleagues in my research group and also my collaborators abroad. Some of them really needed something like that. Of course, because it’s in Nextflow, you have to run it with the terminal, so you have to have some basics in Unix, just to get it to run. But yeah, the main audience is microbiologists because I think many people nowadays, researcher or engineers or lab technician, they will produce genome data for their strains because now it’s something that you do basically every day that, okay, you get a new strain. You will sequence it.

It can be a researcher or other people, but definitely working with microbiology. But yeah, I would advise to have some skills in UNIX. I didn’t make a user interface yet, easy click button thing. But that maybe should be in the plan,

[00:10:54] Arfon Smith: That’s okay, I think there’s a whole collection of tools out there that don’t have like a nice GUI. The interface is the terminal.

I was actually curious, say I buy myself a sequencer or I wanted to use this tool, are people typically running on big clusters? Is there lots of compute under the hood? Like, how big are the jobs? It probably depends on how much data you have. But, typically what sort of hardware would people be running this tool on?

[00:11:19] Juliette Hayer: So I think basically if you have just a nanopore, plug to your laptop and you just sequence maybe, let’s say, four or five strains of enterobacteria that have a genome that is like maybe five megabases, something like that. You could run it on a laptop. I think that would be fine. If the assembly can go on, then you’re fine.

Of course, you can customize the databases that are used for detecting the antimicrobial resistance genes and the plasmids, and also to make the annotation of the genome after all. If you have a lot of space, if you have an HPC in your lab, it’s better if you want to run a lot of strains at the same time, and also to install the larger databases that also provide more power of prediction.

I made it so it could be installed just in a laptop with minimal databases, just to get some results first.

[00:12:19] Abby Cabunoc Mayes: Yeah, so if I get my hands on a Nanopore, and I really want to know what bacteria is growing in my bathroom, I could maybe, if it’s not too big, potentially sequence it. And run it through Baargin

[00:12:28] Arfon Smith: , Let us know if you take that project on as a side hustle. Abby,

[00:12:32] Abby Cabunoc Mayes: I will, yeah.

[00:12:33] Arfon Smith: the results. Laughter.

[00:12:34] Abby Cabunoc Mayes: So you talked a little bit about Nextflow. Can you tell us a bit about its role in bioinformatics? I think Nextflow is more recent than what either Arfon or I have used. So, it would be great to hear, what it does.

[00:12:46] Juliette Hayer: Nextflow is a great, workflow manager. Of course, there are others that exist, so you have different schools with different kind of people. I know there is also SnakeMake. I haven’t used it that much, even though I really like coding in Python, and SnakeMake is based on Python.

But Nextflow, I think they started around 2015 maybe. I hope I’m not wrong, I haven’t checked that before. But I started using it in, 2017 or 18 when I was then at that time in Sweden and I enjoyed really a workshop that I went to in Barcelona where I started learning about Nextflow and how to use it and how to run pipelines and then I met first people of the NF Core community.

I don’t know if you have heard about this, so that’s how Nextflow became very big in bioinformatics, because a few people that started using Nextflow for building bioinformatics pipeline, they met together and they started a real community and they started to put some guidelines on how you should develop a Nextflow pipeline for bioinformatics.

So I think they are huge nowadays because they have put out many pipelines, and many that are really highly used by people in bioinformatics.

[00:14:10] Arfon Smith: So Nextflow sounds like it’s super popular. Yeah, I was also familiar with Snake, SnakeMaker as well. I’ve not used either. I know they’re both popular and there are other tools. Again, I’ve like aware of like Galaxy workflows. I think that’s another tool.

[00:14:23] Juliette Hayer: That’s different. Yeah,

[00:14:24] Arfon Smith: that’s another one again.

There are a bunch of reasons to use a workflow management tool, but one of them is just reproducibility and the ability to reliably execute a set of tools those tools. It makes tons of sense. I was curious though, that workflow is executing Baargin at some point, it’s gonna be a step in the process.

What other tools before this one were available for these kind of tasks? Did you find yourself unhappy with what was already available? What else people might use if they’re not using the software you’ve created?

[00:14:54] Juliette Hayer: So, maybe just one point about Nextflow again that I’m not sure I have mentioned, but it’s based on the language Groovy, which is itself based on Java, and I haven’t that, so that’s that’s worth it. For the other tools that can do several things as what Baargin does, yes of course they exist because many people work on this kind of questions.

One that comes to my mind seem to be really great is Bactopia. I think this is also made in Nextflow but it’s huge, it’s very complex, it’s , in my opinion, quite not easy to install for any user. So, for bioinformaticians and people that are skilled and that know what they are looking for, in the bacterial genome already and which sub workflow they know what they want to do, use, then I would definitely go for Bactopia.

For me, I wanted really something more simple than this and that really had a focus on detecting the plasmid features and the antimicrobial resistance combining different tools that can predict resistance genes.

And there are also other great tools that exist, but can be specific only for one bacterial species. Like I know about Kleborate, which is a pipeline, also very nice, but only for Klebsiella species, which is a kind of bacteria.

So that’s how I ended up coding Baargin. Also to distribute it to our collaborators that not necessarily have the the computing power that is required for installing very large workflows with a lot of databases and all.

So I wanted something lightweight as well.

[00:16:36] Abby Cabunoc Mayes: It was interesting hearing you talk about these, workflow systems, especially Nextflow and things like Galaxy, and how they’ve built a community around that through best practices and these conferences and stuff.

I know you’ve made it primarily for your collaborators, but, do you think there’s room for this to be more widely used with others? Why did you make it open source to begin with, I think was the real question that I was asking around, but go ahead.

[00:17:00] Juliette Hayer: Yeah, so why did I make it open source? Because I’m a researcher and I’m working with collaborators for from other academic institutions all over the world and I think everything we do should be open source. That’s really something I believe and in France in the academic institutions we prone a lot for open source and even more at my institution at IRD. So that is really something that I could not imagine differently.

Also for reproducibility and for all the people to be able to contribute, that’s also important. And to know what they’re doing when they are running that, they can go into the code and it’s not a black box.

So I think for me, that’s very important.

[00:17:46] Abby Cabunoc Mayes: So Juliette, have you run many studies using Baargin yet?

Any interesting insights you want to share?

[00:17:51] Juliette Hayer: So, me and my collaborators and students are using Baarging a lot. I have one study published where we collaborated with Chilean colleagues, where we identified the transmission of super resistant E. coli between wild animals and livestock and companion animal in farms within and between farms in central Chile.

So that was one study. And then we have other studies ongoing in Vietnam, in different hospitals where we isolated Klebsiella strains where that also are resistant to carbapenem and colistin antibiotics. So that’s also interesting. One of my students is working on that. And we have an ongoing project with Institut Pasteur in Cambodia and Battambang Hospital in Cambodia, where we have a lot of bacterial genomes.

That’s the one I talked about before, about 700 different bacteria that we collected from patients to start with, that came to the hospital with resistant infection and we went to their household to collect also bacterial strains from their environment, food, and animals that they have.

We have not published that yet, but we are studying, analyzing the results at the moment. So that will be interesting results that will come soon, hopefully.

[00:19:16] Abby Cabunoc Mayes: Yeah, it sounds like actual useful information that will help people in the world. So, that’s great

[00:19:22] Juliette Hayer: yes

[00:19:23] Abby Cabunoc Mayes: cool, so we’ll link the published studies in the show notes.

[00:19:25] Juliette Hayer: Yeah, it’s really to try to understand how the resistance circulates between the humans and the animals and their environment, like what do they carry, which resistance genes they carry and how do they share it. But yes, I will send you at least the publication that is already published and will keep you updated for the next steps.

[00:19:49] Abby Cabunoc Mayes: Yeah, that’s great. Yeah, we’ll include that in the show notes for sure.

[00:19:51] Arfon Smith: If we could just switch tracks for a second and just talk about JOSS and the fact that you published there. Looking at the paper, I think it was back in October 2023 when the paper was finally published. I was curious if you could just say a bit more about, why you published in JOSS and tell us a little bit about that experience.

[00:20:09] Juliette Hayer: So the first idea came from one of my co authors, Jacques Dainat, who had already published in JOSS, and he told me about how amazing was the experience, and how we should always do that when we publish bioinformatics workflows or tools, because everything is centered on the actual tool or pipeline.

And that is central. And also the fact that the code is revised as well, is reviewed by other people that also know about what you’re doing and probably are in the same field, because you can choose the reviewers. You have a list of potential reviewers in JOSS that you can select and you have an idea of what they are working with.

And of course, you can visit their GitHub, so that also helps. So that was one great thing, and also everything is open during the review process and transparent. Anyone can see what’s happening with the code, and you can get very nice input from the reviewers to also improve your code and the documentation, which is very important.

If I had not published in JOSS, maybe I wouldn’t have put that much effort in writing a well structured documentation with examples and tests and all these things that are actually very good for a workflow, for bioinformatics workflow, because I have seen a lot of other workflows that could be published in more standard life science journals.

And then, what? Usually they are almost impossible to install or there is no documentation because they didn’t need to do that to publish it. So here, yeah, I think it’s good that the code and the documentation is at the center of the whole thing. And also, discussing with the reviewers over GitHub, a tool that we use on a daily basis, that also makes things very convenient.

[00:22:08] Arfon Smith: Sounds like you had a good review experience. I just had a quick look over the review again, before this conversation and it looked like it was a super productive conversation and lots of good feedback. I think you’re right. One of the key areas where I think authors get most value is the review of documentation. It is so valuable to have somebody know nothing about your tool and just, start with a fresh blank slate, new machine, new environment.

There’s probably another name, which is like Journal of Open Source Documentation or something. It’s the software of course, but usability often begins with great docs and clearly defined dependencies. But it’s just so hard to be objective as the author of software when thinking about what would somebody need to know and the sort of undocumented steps. If we had to look where most changes get made as a result of a JOSS review, my guess would be docs every time.

I think it’s the most common area of change and just reinforces that software is not just about the code executing on the machine, but it’s all the bits around the side that are so important for the humans as well who are going to be operating that software

[00:23:21] Juliette Hayer: Yeah,

[00:23:22] Abby Cabunoc Mayes: One thing I do like about that process is just how it pushes you to make it more usable for a broader group. I know we talked about, like, are you building that open source ecosystem? This is one step towards that, having that good documentation so others can

[00:23:34] Juliette Hayer: That’s true,

[00:23:35] Abby Cabunoc Mayes: jump in and use it.

Is there anything else you learned going through this review process or you’re grateful for?

[00:23:40] Juliette Hayer: Yes, they really insisted on running test, so it’s not easy to make unique testing and continuous integration when you work with workflows, so I’m very open to suggestions for that and to contribution if people want to help with this, because it’s not an easy and trivial task. They gave some hints and some input regarding things that I should try .

I haven’t completely done it, but I have made a full test that can test where you can test all the mandatory steps of the pipeline, but it doesn’t test process by process. So that was useful to really make me think about the things that I should improve from that point of view also.

[00:24:20] Arfon Smith: Testing something like a workflow tool is probably pretty hard, right? There are some things where we have this challenge with JOSS where we, as part of the review process, as you probably remember, the actual language is a little unusual. We say reviewers must be able to objectively test the functionality or verify the functionality of the software.

So we don’t say, you must have a million unit tests and 99 percent test coverage. What we’re trying to get to is: You need to be able to verify that this thing works. And that seems like a reasonable thing. But, there are a number of times when that actually can be quite hard. You know, when it’s a complex system, maybe it’s running on a cluster.

It’s really hard for a user to verify that. Custom hardware often makes it hard for people where, you know, you need a particular variant of a GPU or something. Another one is actually just complex user interfaces. If it’s just a command line tool with standard inputs and outputs, that can be quite easy to write normal tests.

If it’s a really complex set of interactions the user has to do, the testing can be a huge amount of the work, actually. So, yeah, I hope we find the balance there.

[00:25:33] Juliette Hayer: But I think that’s something very nice with JOSS is that at least it forced you to have a test that is running on the machine of other people, which not everyone can pretend to have when they publish.

[00:25:47] Arfon Smith: It’s true. It’s true. Absolutely. Yep,

[00:25:49] Abby Cabunoc Mayes: So you mentioned that you’re open to testing contributions. So if people do want to contribute, what sort of skills do they need? What languages? What sort of background are you looking for?

[00:25:58] Juliette Hayer: I think they need to be quite comfortable or confident using git and Nextflow. That’s the basic, because the workflow is coded with system DSL 2. So I have all the steps as modules. All the processes are in separated modules that can be reused in other workflows as well. So it’s quite easy if you can code in the Nextflow to add a new module and then, clone the repo and change the main script accordingly to add the step you want to be added in the middle, in the full workflow. So people that also want to modify a part of it or add a step that I did not necessarily have in Baargin already, they can do that, they can clone or fork or suggest me to add If it’s something that should be added, of course.

[00:26:51] Arfon Smith: It sounds like contributions are welcome, which is great. I was going to say, are there obvious things that you’re personally interested in extending the software with at this point? Or is it mostly sort of done for your needs and your collaborators needs?

[00:27:05] Juliette Hayer: For the needs we have now, it’s mostly done, but no, some tools, I’m already thinking about changing them or adding other options. Because I really like combining different methods for doing the same thing. And so I have at the moment two different tools for predicting the resistance genes, as I said before.

I would be happy to add new ones that are coming. But then I would need to work on harmonizing the results. Which is not necessarily easy because they use different databases, and usually genes in the databases, they can have different names and all, so that’s an all different story. But that would be fun, I think.

For the multi locus sequence typing I want to add, now I have a very basic tool that’s working on seven genes for typing the bacteria. And I want to add another tool that would use much more genes to base the typing on. So that’s the first thing I want to do.

And then there are others, of course. I would like to improve all the detection of the mobile genetic elements. That’s something I really need to work with. Investigate other tools that are to be included.

[00:28:22] Arfon Smith: Sounds like you’ve got quite a roadmap ahead of you, actually.

[00:28:24] Juliette Hayer: Yeah.

[00:28:25] Abby Cabunoc Mayes: yeah, and it also sounds a bit like the kind of contributions that would be most helpful is Other users that have their own use case and they want to add different parts of the workflow and change up a little. Is that, yeah. Cool. Awesome. So if you’re listening and you want to use Baargin, Juliette here welcomes your contribution.

[00:28:43] Arfon Smith: Sounds like it.

[00:28:44] Abby Cabunoc Mayes: Just to close us off, how can people find you online and keep up to date with your work?

[00:28:48] Juliette Hayer: So they can find me on GitHub, of course, under @jhayer , where Baargin is released, and you can find me on Researchgate and on X, and it’s Juliette underscore Hayer.

[00:29:01] Abby Cabunoc Mayes: Perfect.

[00:29:01] Arfon Smith: Awesome.

Yeah. Well, Juliette, thank you so much for coming and being part of the JOSSCast. It’s been great to talk to you today about the Software, Baargin, I’m still not saying it right, I’m

[00:29:13] Juliette Hayer: It’s okay.

[00:29:14] Arfon Smith: it’s okay, I’m getting better. Okay, Baargin and the problems it solves, it sounds incredibly relevant, I know, I worry about anti bacterial resistance. I think that’s a really important thing for humanity to be working on. I’m grateful for the work that you and your team are doing, and for publishing in JOSS.

[00:29:33] Juliette Hayer: Thank you very much for having me.

[00:29:41] Abby Cabunoc Mayes: Thank you so much for listening to Open Source for Researchers. This is our first episode that we’ve released since our launch. It’s been amazing to see the response. Thank you so much for subscribing for telling your friends for sharing on social media. We love to showcase open-source software built by and for researchers. So subscribe to hear more in your favorite podcast app.

Open Source for Researchers is produced and hosted by Arfon Smith and me, Abby Cabunoc Mayes, edited by Abby and music is CC-BY Boxcat Games.

[/expand]

Video

25 Jan, 2024
josscast

JOSSCast #2: Astronomy in the Open – Dr. Taylor James Bell on Eureka!

Subscribe Now: Apple, Spotify, YouTube, RSS

In this episode of Open Source for Researchers hosts Abby and Arfon explore the world of open source software in astronomy with Dr. Taylor James Bell, a BAER Institute postdoc at NASA Ames. Eureka! is an end-to-end pipeline designed for JWST (James Webb Space Telescope) time series observations. We chat about the motivations behind Eureka!, its unique features, and the democratization of exoplanet science.

Join us for an engaging conversation that unveils the complexity of time series observations, the power of open source, and the exciting future of Eureka! and JWST discoveries.

Transcript

[00:00:04] Abby: Welcome to Open Source for Researchers, a podcast showcasing open source software built by researchers for researchers. My name’s Abby

[00:00:11] Arfon: and I’m Arfon.

[00:00:13] Abby: and we’re your hosts. Every other week we interview an author published in the Journal of Open Source Software, JOSS, and this is episode two . We’re chatting with Dr. Taylor James Bell on his software Eureka! and the paper’s called Eureka!: an end to end pipeline for JWST time series observations.

[00:00:30] Arfon: So this is an exciting episode for me, we get to dig into some of the open source software that’s been built for the James Webb Space Telescope, which is a mission I used to work on in a previous life. This is a groundbreaking scientific observatory that’s in space right now and doing great science and it was really interesting to chat with Taylor about the work he’s doing with the team.

[00:00:49] Abby: Yeah, and even for me as not an astronomer who didn’t quite understand everything about exoplanets, I still really enjoyed just his experience trying to build the software in a way that is community focused and not just publishing a paper and having it go out

[00:01:05] Arfon: Absolutely. Yeah. Looking forward to jumping in.

[00:01:08] Abby: Let’s do it.. Taylor, do you want to say a little bit about yourself?

[00:01:12] Taylor: Yeah, I’m a BAER Institute postdoc here at NASA Ames in Mountain View, California. I work on studying the atmospheres of exoplanets, so planets orbiting distant stars. And to do that, I do a lot of data analysis and recently open source software.

[expand]

[00:01:29] Abby: That’s amazing.

Just to kick us off, why did you start this project, this piece of software?

[00:01:34] Taylor: it was initially started by two of my colleagues, Kevin Stevenson and Laura Kreidberg, two faculty in astronomy. and they’re both well respected experts in the analysis of Spitzer and Hubble space telescope data.

But we wanted to prepare for the JWST launch, which was coming up in about a year from then. So the two of them have written software for Hubble and Spitzer and wanted to get ready. And they wanted to do something different this time. Their code for Hubble and Spitzer, sometimes it was open source, sometimes it wasn’t, but it was never community built, community driven and really truly open source. Not just open source, but like well documented and user guides and documentation and all of these things. So they reached out to a broad group of experts in the fields to get involved and make it a large team project.

And so about a dozen of us joined the team. and just naturally through those collaborations, I rose to a kind of primary contributor level in the team.

[00:02:39] Abby: That’s awesome. and I should have mentioned before, we are talking about Eureka!.

[00:02:42] Arfon: Yeah, I was going to say this is really cool for me to hear about this piece of software, Taylor.

I used to work at Space Telescope Science Institute in Baltimore, which is where the James Webb Space Telescope is operated from. And so it’s really nice to see such a rich sort of ecosystem of open source tools that are being developed around the Webb Telescope. That seems to be turning out to be, a defining feature of this mission as compared to earlier missions.

I guess, actually other space missions have been good at this. I know Kepler was very strong, especially in the K2 phase of the mission was really strong on open source and community software as well.

[00:03:15] Abby: Yeah, so related that do you have a strong background in open source?

Or is this your first time doing something open source?

[00:03:21] Taylor: So my undergrad was in physics, but I did a minor in computer science. I took about a half a dozen classes or more. Taking, a lot of programming principles and practices and stuff like that. So I got some experience. In the kind of 2018 to 2020, I wrote two open source packages.

One of them falls nicely into the open source and well documented and, all of those things. That was a simulation tool, a pretty basic little tool to model what an atmosphere might look like. Like what an observation of a planetary atmosphere might look like. And then I also built a pipeline for Spitzer. I co developed that.

It was open source, but again, not in that regime of actually, broadlly usable by people.

[00:04:08] Arfon: I was going to ask about who your target audience is for this software. So I know a little bit about what Webb is doing and looking at the atmosphere of exoplanets. Maybe I’d be, I think, helpful to just explain a little bit about what the telescope’s doing, what type of observations it’s making, and who cares about those kind of observations? Would you mind sharing a little bit of just background. Who, uses this and why, that’s interesting?

[00:04:33] Taylor: For sure, yeah. Eureka! targets time series observations. So these are observations that, , monitor something over a length of time.

And so we’re monitoring changes in brightness. The primary use case that we’ve developed it for, which is all of our developers primary science use, is to observe exoplanets. So we look at transits, eclipses, things where we’re looking for a change in brightness of the combined system of fractions of a percent.

And we’re looking for very small features. A lot of existing institutional code is not excellent for time series observations because time series observations are very different from just taking a single snapshot of a galaxy. You have to treat the data in very different ways when you’re treating single snapshots. You really want to know how bright is every individual source absolutely. How many exact number of photons are coming from that. Whereas for time series, you want to know very precisely how that number is changing. You don’t care what the constant value is. You just care how it changes. and yeah, there’s a lot of differences there.

We built this pipeline for researchers. We’ve built it in no small part for our own use. and that’s been part of our big success is that all of us developers are very actively using the code. But really one of the driving motivations behind it was that programming ability shouldn’t define scientific success.

[00:05:59] Taylor: And so there’s a lot of, especially early career researchers and people all over the globe that want to participate in exoplanet science, but don’t have the programming ability to do to build a complete pipeline from scratch that is independently verified and things like that.

It’s a huge feat and it was something that the community struggled a lot with Hubble and Spitzer. And yeah, our target demographic really ranges from undergrad researchers working for professors and the professor gives them a single observation all the way to faculty and postdoc who just want to quickly crank something through.

[00:06:35] Abby: I really resonated with what you said about how the ability to code shouldn’t determine your ability to do research and how this open source code is really democratizing science in a way that we haven’t seen in a while.

So that’s exciting.

[00:06:46] Arfon: Yeah, I was going to ask, it’s related on that. How low is that barrier? I know a little bit about where the JWST data ends up. A lot of it is in the MAST archive at Space Telescope and I think you can get it in the European archives as well.

Can I just go download some data and run this tool? How turnkey is it for, an end user?

[00:07:04] Taylor: That’s something we’ve been struggling with in two mutually opposing ways. One is, yeah, a lot of JWST data, there’s an initial proprietary period of about 12 months or something, but then eventually it becomes public and so anyone can look at it.

Then, there is some amount of analyzed data that is archived along with those raw data, and that’s not particularly good quality at this point in time. STScI, the Space Telescope Science Institute, is working on improving that data quality. But in the meantime, it’s in a fairly rough state for time series observations.

And then for actually using Eureka! to get science data out. We’re fighting kind of two opposing directions. One is we want it to be really easy and accessible for everyone to use. No programming, obviously, but minimal barriers. And so we’re working on, making better documentation, making better templates.

So that basically you just need to specify where your inputs and outputs are and you can get a half decent first result. But something that we have learned with Hubble and Spitzer is that what works well for one observation almost never works perfectly for another observation. It might work okay, but that doesn’t mean your results are going to be ideal or even necessarily exactly right.

And so there’s some amount of fine tuned fiddling that needs to be done, and some amount of intuition that needs to be done of what do you check to make sure that what you’re getting out is reasonable. And so we don’t want people to think of Eureka! as a black box, that they just put something in, they get something out, and they’re like, okay, fine. I’ll write my paper on that.

And so one of the things that we’re doing is we’re putting out a lot of log information, but trying not to overwhelm people, just give them some heuristics of what’s going well but also plot, plot all kinds of things because humans are pretty visual creatures and then being able to just click on all the things and gain some intuition for oh, if I change this parameter, this thing changes in some desirable or undesirable way.

And yeah, just really trying to graphically demonstrate things to people.

[00:09:20] Arfon: Awesome. That’s really cool. It sounds like. I probably could, as long as the data is outside the proprietary period, I probably could take the software for a spin, but you’d need to be a little careful, which I guess makes sense. It might work, but you’d want to be careful and really validate the kind of outputs you were getting before putting those, results in a paper or anything.

[00:09:40] Taylor: Yeah, the way, I think of it is it’s the default values are great for a quick first look, Do I have an atmosphere? Did I catch a transit? Did the observations work? And then when you want to get to publication readiness, when you have to twist all those little knobs.

[00:09:53] Arfon: I actually have a side question, which is, how do you know when to look? So I think if I understand what’s right, what you’re saying is, you’re looking at an exoplanet. You’re interested in atmospheres, which means you’re looking for an exoplanet as it goes in front of the star that it’s orbiting. JWST is not looking there all the time, right?

Instead doing other things. So how do you know when to make the observation? Is that, a hard thing to figure out? Is that something that Eureka! does as well? Or is that a separate thing?

[00:10:22] Taylor: No, yeah, that’s separate. JWST for, the vast majority of its time is not, at least for exoplanet stuff, it’s not looking for anything new.

It’s, not looking for new planets. It’s looking to characterize well monitored planets so that we know the orbital period so Kepler is a great example. It’s observed a hundred transits of the thing because it has an orbit of a day and so just by staring at it for a month you catch 30 transits and you know exactly when it’s going to transit. And then ,doing some fancy fitting you can try and figure out when it should eclipse based on what you think it’s orbital eccentricity is and things like that.

Yeah, just a lot of Kepler’s laws and using the Kepler space telescope as well as excellent ground based telescopes like loss is.

[00:11:09] Arfon: Cool. Awesome. Thanks for that. So I guess one of the thing I’d be really curious to learn about is what, how does this Eureka! package differ from what’s already available? I think exoplanet science is not brand new. It’s relatively recent. Are there other tools that people use where you’re finding a gap in what was possible?

Can you say a bit more about what people might have done before Eureka! even, or what they’re doing alternatively today if people aren’t using the tool.

[00:11:34] Taylor: Yeah, so there’s numerous alternatives at present for JWST data. Most of them are developed specifically for JWST data. I find that is a very good thing, especially early on in telescope’s life, having many different people do many different things, and then we’ve worked somewhat on comparing those pipelines at the end, because, yeah, it’s great if you have multiple things, but if that means a hundred different papers get published using a hundred different pipelines, and you don’t know if you can compare planet A to planet B because they’re different so it’s great to have multiple pipelines but also need to compare them.

As for how Eureka! differs to them many pipelines are focused on just dealing with a single instrument on JWST. JWST has four primary science instruments. You have NIRCam, NIRISS, NIRSpec, and a mid infrared detector, MIRI. and within each of them, you also have multiple different filters and different observing modes.

You have spectroscopic data, you have photometric data, where you’re just taking images instead of spectra. There’s really a diverse amount of possibilities. and exoplanets uses basically all of them. Not all of the modes, obviously, but an enormous amount, and all four instruments. And at present, we’re the, one of the few that targets all four instruments with all exoplanet specific observing modes.

We’re currently missing the NIRISS instrument but we’ve just got funding just earlier this week to implement NIRISS code in there. That’s one big differentiation. Another is that it is community developed and community supported. There’s, about a dozen of us who have contributed large parts, but we get almost weekly, if not more often issues submitted through GitHub, which , allows us to help support people.

A lot of pipelines that just exist, yes, they might be hosted on GitHub, but it’s hard to get anyone to respond. Whereas there’s many of us who try and work with people, help them get up and running. and compared to what existed before JWST, it’s complex with all of these different instruments but in some ways, some of the instruments are quite like Hubble, and some of the data is like Spitzer.

And this, pipeline is kind of Frankenstein bits of Hubble and Spitzer code that was previously written by Laura, Kevin, and I as well as some other people. And so it’s drawing on all of that together. and at present, we allow Hubble and JWST data to be fitted with Eureka!. It’s my personal aspiration that someday we might add Spitzer, but Spitzer is now decommissioned, so it would mostly be for, archival research, which takes a lot of energy to build something new for something that’s just archival research.

There’s also the Space Telescope Science Institute does have its own software to support observers. One of its big weaknesses, in my mind, is lack of plots. There’s no plots that are made. You can save intermediate results, but a lot of people just want to run the pipeline and not, have to interject a bunch of things to make plots or force save every intermediate result and then plot it after.

And so that’s, one big thing that we’re doing is just really making visible all of the intermediate results. And then we also are, we call it an end to end pipeline, where we go from raw pixels to planetary spectra. And in truth, there is slightly further you can go of turning planetary spectra into okay, what’s that atmosphere made out of? Is there carbon? Is there carbon monoxide? Stuff like that.

But we do largely end to end. Stuff which is, again, not something that the STScI pipeline does, and, something that is often done by other community pipelines.

[00:15:44] Abby: That’s amazing. First off, congrats on the funding. That’s always huge to get that, especially in the research space.

I also really resonated with what you were saying about being community built I do think a lot of the power of open source comes from building with the community and with others. So how are you getting the word out to the community? Is this why you published in JOSS, the Journal of Open Source Software?

What else are you doing to bring the community together?

[00:16:05] Taylor: Yeah, it was definitely a large part of why we published in JOSS A), to make sure people heard about us, have a nice, clean way to publicize the pipeline and its state at that point. We, also were quite involved in early release science.

JWST came with early release science observations, which were community proposed observations that would test JWST and in many of the key science cases and collect some open public data for the community to learn on and do a lot of these pipeline comparisons.

And so a lot of, I think every ERS Early release science paper had Eureka! on it. and an advantage of having a dozen developers on it and also just making it open source. Several of the analyses that have been published so far are not by Eureka! developers. We’ve also been working on leading, tutorial sessions and doing workshops at the Sagan Summer Workshop held at Caltech this past summer.

We held two hands on sessions, training people how to go from pixels to light curves, and light curves to planetary spectra. And that was a free conference to attend, either in person or virtually. and so we had hundreds of people participating in that, primarily early researchers, but also some faculty and stuff like that.

And so that Sagan Summer Workshop covered the Eureka! pipeline which takes us from pixels to planetary spectra, and then two different methods of going from planetary spectra to atmospheric compositions. And yeah, just doing a lot of kind of outreach like that and then also just letting our work speak for itself. We’re comparable to a lot of pipelines, we’re easy to use. Encouraging people to check them.

[00:17:58] Abby: Yeah, that’s great. And I do love the workshop method for getting people on board. Are you finding that a lot of these users become contributors eventually or start contributing in small ways?

[00:18:09] Taylor: That hasn’t yet happened. It’s, still very early days.

We haven’t even released version 1. 0. We’re on 0. 10, I think we just released. We, have had community members contributing code. But that workshop was, only a couple months ago. We do get issues submitted by the workshop attendees and, others, which helps. But primarily it’s been us developers solving issues for people at this point and less people contributing their own solutions.

[00:18:37] Abby: Right. I do think that issues are a valid form of contribution to open source. That’s very important.

[00:18:43] Taylor: Good point.

[00:18:43] Arfon: I was actually going to say I mean Abby and I probably spend a lot of our lives doing this. Maybe you do as well, Taylor, but, when you go and look at a project open source project, you can get a sense just by clicking around if stuff’s happening, are there issues being opened and closed, are there pull request being opened and merged. And Eureka! looks great. There seems to be tons of activity on the project, which looks fun and it looks like it’s really a healthy project.

Lots of people opening issues, lots open, but lots have also been closed. It’s always a good sign to see that. yeah, congrats on a really, nice project.

I wanted to ask a little bit about the review process that you went through. So just to explain very briefly, this of course, Taylor, the JOSS review process is an open one.

It happens on a GitHub issue. A couple of people come and go through a checklist driven review process and then ask questions and suggest improvements. I know it was a bit over a year ago, so I know it’s a little while ago, but I wondered if you had any things you wanted to share about that review. Did it help you as authors of the software? What was your kind of lasting experience from that?

[00:19:45] Taylor: Before Eureka!, all pipelines that I’ve written and all pipelines that I can immediately think of that were written were never published in themselves. They were always like, I’ve got these new Spitzer data, so I wrote a pipeline. I tersely write what the pipeline does in the paper, but it really it’s a science paper. So as a result the referees of that paper really don’t focus that much on the pipeline. It’s left to the developer of that pipeline to cross validate it with other pipelines and make sure that they didn’t drop the square root of two anywhere or things like that. With JOSS, it’s still on you to not drop, square roots of two and stuff, but it was much more focused on the pipeline and on the documentation and making sure that there’s, an easy way to install the pipeline and tutorials and things like that.

Which was significantly beneficial to the pipeline. It was great to really have the spotlight on the pipeline itself.

[00:20:50] Arfon: Typically you’d have astronomers reviewing a piece of astronomy pipeline. These people could be users I guess, or proto users. You could imagine review might touch on code quality, documentation, installation. Was there a particular area where you’re like, oh yeah, this is significantly improved some outcome here, or was it just a fairly broad analysis by the reviewers?

[00:21:08] Taylor: I’d say the area in which we most improved as a result of the JOSS review is documentation. It was something we had spent a significant amount of time on, and post facto writing documentation, which is always the wrong way to do things, because it’s so miserable to read other people’s code.

Again, most of this was written for Hubble and Spitzer, many years ago, and so it’s even the people who wrote it don’t necessarily remember what they wrote. I spent a lot of time writing documentation and like trying to make it uniform. In preparation for the JOSS review, and then the JOSS reviewers pointed out even more situations where, it could be improved, which definitely made things a lot better.

I’ll admit we still have a folder of old functions that were written initially in IDL and then converted to Python, but the comments never changed. But we have 30, 000 lines of code and at least at the start we weren’t being paid for developing this, and so it was us donating our time.

Now that we have some money, we can Invest more time in this, but again, 30, 000 lines of code, you can only do so much. We really focused on the user accessible layers of things and a lot of the really deep stuff that’s five, six functions deep is where you start getting a lot of computation.

But the documentation and the the way of installing the pipeline were definitely improved.

[00:22:32] Abby: I guess related to that, this does sound like it’s one of the bigger open source projects and definitely one of the more community focused open source projects you’ve ever contributed to. And I know the JOSS review process is really focused on getting that stability or that groundwork you need to build that flourishing community.

Any big surprises come your way working open source or working openly?

[00:22:51] Taylor: I think the initial surprise that I had was, at how functional we were working with, a dozen people working on the code. There’s so many opportunities for people trying to do the same thing and, overlapping and, then you have nasty merge conflicts and stuff.

But, yeah, at the start we had the core dozen or so of us would have a Zoom meeting every week or two weeks to divide up some tasks. But then I’ve been surprised at how well things have kept running. It hasn’t fizzled out because that’s so easy to happen. A lot of projects start and then people are kind of like eh, good enough.

We definitely have less contributions because the pipeline is getting to a good state. I mean, a better state. But there’s still a lot of interest from many of the members and the community interest is ramping up as time goes and we’re starting to see more and more pull requests which is, yeah, takes the burden off of us too if people, are, writing it like they have an observation of two planets instead of one, and they have to do some special stuff to the code, and then they just contribute that, rather than us having to be like, oh, we have to build something from scratch, if people are doing this, and they do it for themselves, and then people have been very willing to just offer that back to the community in a in- kind donation, which is awesome.

[00:24:19] Abby: I love that. The piece that surprised you is just how well collaboration and open source is working. It’s perfect.

[00:24:25] Arfon: That’s nice. Yeah.

I wonder if I could take us back to the web and science just for a moment. I was curious, like the, software’s existed for a while. I think you mentioned that you built Eureka! to be ready for the early release science program.

As a bit of a space nerd, I was myself, I was curious, are there any results that you’re particularly excited about from JWST? Related to Eureka! or not, are there any things that are really memorable for you and, or, things that you’re anticipating would be, really exciting discoveries that you’re hoping to see in the not too distant future.

[00:24:59] Taylor: Yeah, I think one was all of the early release science observations. We had observations with all four instruments. I was just stunned by the beauty of the data and the remarkable quality of it. I’ve talked a lot about Hubble and Spitzer and both of them, their raw data was kind of ugly for exoplanet science.

There’s just a lot of noise and a lot of dealing with 1 percent systematics to get to 0. 01 percent astrophysical signal. Whereas here we’re dealing with noise that’s at or below the astrophysical signal. And so just the astrophysical stuff just pops up.

One example is the carbon dioxide detection on WASP 39. This was the first paper put out by the ERS team. And that’s just a great example of run it through first time, you didn’t fine tune everything and then boom, you see this feature that we’ve been looking for a long time.

That was super exciting. More recently I just had a paper come out in Nature where we used Eureka! and another pipeline to to find methane in a exoplanet atmosphere. And so this is something that we’ve been looking for a long time in transiting exoplanets. But Spitzer was photometric, and its filters were very broad, and so it averages together a lot of different wavelengths all into a single point.

And so methane would be mixed with water, and carbon monoxide, carbon dioxide, and there’s a whole big mess of things, and so really saying that there’s methane in the planetary atmosphere has been very tough, but with that paper, yeah, you could just see the bump and divot caused by methane, and yeah, that’s been really exciting too.

[00:26:44] Arfon: So this is just the higher resolution spectra is that why you can see it. It’s not convolved with some other, signature,

[00:26:51] Taylor: Higher, resolution spectra and, just higher, signal. Just JWST is so much larger.

[00:26:56] Arfon: Bigger mirror!

[00:26:57] Taylor: Big, bigger mirror, better resolution, and vastly smaller systematic noise components all together make it possible.

[00:27:07] Abby: That’s great. So it does sound like this project is open for contributions. What kind of skills do I need to jump in?

[00:27:14] Taylor: So it’s all written in Python. Yeah, we’ve considered doing some Cython to make some core things faster, but as of right now everything’s Python. So it’s primarily just Python and GitHub.

Just knowing how to use those two. Basically the only things you need to know how to contribute. There’s open source data that you can mess around with and try things out and see what works for you, what doesn’t. We do have a community policy to make sure that your contributions are appreciated and respected and that everyone else is also respected.

Yeah, it’s, basically GitHub, Python and then some time.

[00:27:51] Abby: That’s great. and any any scientific knowledge needed?

[00:27:54] Taylor: We, try to be somewhat pedagogical with some of the stuff. Especially the, first four, Eureka!’s broken into six stages. The first two stages are just really dealing with nitty gritty JWST specific pixel stuff.

And for that we rely on the software built by STScI, their JWST pipeline. But then we tweak it in some ways. Then the middle stages are data analysis. They’re not really science focused. It’s like how do you go from pixels to a spectrum.

And then stage five is where you get into a little bit of science. What does exoplanet transit look like? And there’s great open source packages that we’ve been using for that. And then stage six is just stapling together a hundred different wavelengths, just put them all into one section. So yeah, fairly minimal requirements for learning the science behind it.

[00:28:47] Arfon: Cool.

I think I’ve stared at bits of the JWST pipeline in the past, so maybe I’ll find some of that code in there too. So what’s the best way that people can stay up to date with your work, Taylor? It sounds like you had a Nature page recently, so congrats on that. But how can people find you online and, keep up to date with Eureka! and its latest releases?

What’s the best way to for folks to keep track?

[00:29:08] Taylor: Probably the best way is my personal website, www.taylorbell.ca. I have all my socials and stuff linked on there and I try to keep it up to date. There’s some blog posts of some science I’ve done and many of them are just placeholders where I’ll eventually write stuff, but yeah. That’s a good place to keep track on me and find my GitHub and things like that, too.

[00:29:32] Arfon: Cool. Thanks. And future releases of Eureka!? Is GitHub the right place to keep track of that as well?

Is that where you would release new versions?

[00:29:40] Taylor: Yeah, yeah, we just released version 0. 10 probably two weeks ago, and we’re working actively on version one right now where we’ll not really add any new features, but just make the maintenance a lot more sustainable on us and just reorganize things and do some backwards compatibility breaking to make things easier for users and developers.

[00:30:00] Abby: Nice. So that’s on GitHub. That’s taylorbell57, I believe. We’ll link all this in the show notes so people can go find those. But thank you so much, Taylor, for joining us. This is a great conversation.

[00:30:10] Taylor:

Thanks so much for having me.

[00:30:12] Abby: Of course.

[00:30:12] Arfon: Yeah, and congrats on a really nice piece of software. It looks really great. I wish you all the best and I hope, there are many future releases. it looks really great. So thanks again for telling us about your package.

[/expand]

Video

25 Jan, 2024
josscast

JOSSCast #1: Eva Maxfield Brown on Speakerbox – Open Source Speaker Identification for Political Science

Subscribe Now: Apple, Spotify, YouTube, RSS

In the first episode of Open Source for Researchers, hosts Arfon and Abby sit down with Eva Maxfield Brown to discuss Speakerbox, an open source speaker identification tool.

Originally part of the Council Data Project, Speakerbox was used to train models to identify city council members speaking in transcripts, starting with cities like Seattle. Speakerbox can run on your laptop, making this a cost-effective solution for many civic hackers, citizen scientists, and now .. podcasters!

From the advantages of fine-tuning pre-trianed models for personalized speaker identification to the concept of few-shot learning, Eva walks us through her solution. Want to know how this open source project came to life? Tune in to hear about Eva’s journey with Speakerbox and publishing in JOSS!

Transcript

[00:00:00] Arfon: Welcome to Open Source for Researchers, a podcast showcasing open source software built by researchers for researchers.

My name is Arfon.

[00:00:11] Abby: And I’m Abby.

[00:00:12] Arfon: And we’re your hosts. Every week, we interview an author who’s published in the Journal of Open Source Software. This week, we’re going to be talking to Eva Maxfield Brown about their software, Speakerbox, few shot learning for speaker identification with transformers.

[00:00:27] Abby: Yeah. And I think this was a great interview to kick off with. Eva was really excited to talk about her work. And I thought it was very applicable for podcasting!

[00:00:35] Arfon: Absolutely. And, I don’t think I said it at the start, but this is our first ever episode, so we’re really excited to start with this really interesting piece of software.

[00:00:44] Abby: Awesome, I guess we can just dive right in.

[00:00:46] Arfon: Yeah, let’s do it.

Welcome to the podcast, Eva.

[00:00:49] Eva: Hi, how are you?

[00:00:50] Arfon: Great. Thanks for coming on.

[expand]

[00:00:52] Abby: Yeah, glad to have you in one of our early episodes.

But do you want to tell us a little bit about yourself just to kick us off?

[00:00:58] Eva: Yeah, sure. My name is Eva Maxfield Brown. I’m a PhD student at the University of Washington Information School. My research primarily focuses on the science of scientific software more generally, but I also like to believe that I practice building scientific software. So I have experience building software for computational biology or microscopy and also political and information science.

The project that we’ll be talking about today falls into that political and information science capacity of building not just studying.

[00:01:27] Arfon: Awesome.

Tell us about Speakerbox. Tell us about the project. This is a paper that was reviewed in JOSS about a year ago now, a little bit over, or initially submitted. I’m just curious, if you could tell us a little bit about the project, why you started it, and what kinds of problems you’re trying to solve with Speakerbox.

[00:01:43] Eva: Yeah. I’ll have to back up for a little bit. Speakerbox is part of a larger ecosystem of tools that fall under this project that we were calling Council Data Project. This was a system that basically was like, Oh, what’s really hard for political scientists is studying local government and one of the reasons why it’s hard to study local government is most of the stuff you have to do is qualitative.

It’s very hard to get transcripts. It’s very hard to get audio. It’s very hard to get all these different things in a standardized format for all these different councils. So, Council Data Project started a long time ago and it tried to lay the groundwork of getting all those transcripts and systematizing it in a way.

But one of the longest requested features of that project was being able to say that each sentence in a transcript was said by who, like this sentence was said by Council Member A and this sentence was said by Council Member B and so on. And so there’s a problem here, right?

We could spend the time and resources to build and train individual speaker identification models for each city council. But that’s a lot of time and resources that we don’t really have access to both as researchers, but also just as people interested in contributing to the local government area.

And so Speakerbox really came out of that problem of how do you quickly annotate and train individual speaker identification models for application across a transcript, across a timestamped transcript.

[00:03:00] Abby: Yeah, reading this paper I got really excited as a nascent podcaster. So I could see how this is immediately applicable because I have this audio files, and can we identify who’s speaking with what. Anyways, I could see how this could be useful.

[00:03:14] Arfon: Yeah, maybe we could, maybe we should try it

[00:03:17] Abby: Yeah, we can. On this episode.

[00:03:19] Eva: There was, I think there was one of the early GitHub issues that we got after publishing the paper was from someone who was recording their research lab’s meetings and basically wanting to train a model just to tag their own lab members. And I was like, I guess that’s a use case. Sure. That makes sense.

[00:03:35] Abby: We can train it to tag ourselves as hosts identify a mysterious third person each episode.

[00:03:41] Arfon: Yeah, there you go

I was going to say, so, the Council Data Project, could you say a bit more about who the users are, who the target audience here is? Is that like studying government and how councils work? Who is your audience?

Who would use Speakerbox?

[00:03:55] Eva: Yeah. So that’s a really good question. So I would say the larger ecosystem of Council Data Project would fall into political science, especially the subdomain of urban or city level, municipal level political scholarship. Then there’s also information science questions.

And I think there’s a lot of really interesting information science questions, such as how do ideas and misinformation and disinformation enter city councils and so on. So I think those are two very core audiences of that research, but there’s also the public side.

So there’s just general users, right? Council Data Projects comes with a website and searchable index that you can search through all the meetings that we’ve transcribed and everything. So we do have general citizen users. We also have people at the cities that we support like city employees as users as well. And journalists.

So there’s a couple of different people that are interested in it from the Council Data Project angle. But specifically for Speakerbox, as we just talked about I think there’s a number of cases where this is just useful. So that was one of the reasons why we split it out as its own thing . Certain tools for Council Data Project will likely be hard coded towards our infrastructure or towards our technology stack. But, Speakerbox is just this general toolbox of quickly how to train a speaker identification model has a broader application set.

So things like podcasting, things still in the domain of journalism or in the domain of political science, sure, but also in music or in anything else that kind of deals with this very fast audio labeling type of dilemma.

[00:05:21] Abby: Yeah, another thing that really struck me reading the paper and also just hearing you talk about the Council Data Project plus Speakerbox. There’s a lot of open source and open data tools and resources that you’re using and you’re drawing from.

Did that really affect why you’re doing this openly? Can you just tell me about the whole ecosystem?

[00:05:38] Eva: Yeah. that’s a good question. I think from the start we positioned. Council Data Project in the genre or domain of the prior, past era of open data and civic collaboration type of, angle. We originally envisioned that a different group would be able to deploy all of our tools and technology for their city council, wherever they are.

That would just reduce the cost on us. And so that was partly the reason why moving to the open source model was really nice. All the documentation can be available if they need to make an edit for their own city, whatever it is, they can contribute back. There are also existing projects prior to us, in the history of this civic information landscape that did this pattern. But I think we’ve shifted it to our own needs a little bit. That said, the fast pace of machine learning and natural language processing tools is just so quick that you could expect a new version of some package came out and there’s going to be, a breaking change.

And to be honest, at this point, I work on a lot of different projects and it’s really nice to push it open source just because I know that there are other people, whether they’re related to Council Data Project or not, that are also going to use it and need it and can contribute to helping fix and maintain and sustain the project itself.

[00:06:47] Arfon: So are there really obvious alternatives? I understand why you would want to publish this. There’s lots of opportunity for pushing things into the public domain but are there tools that other people were using previously that you’ve found efficient or are maybe using I don’t know, proprietary languages or something.

What were the motivations for publishing a new tool specifically?

[00:07:07] Eva: I think there definitely are proprietary technologies that allow you to do this. I think maybe Google is probably the most famous. I think they have a whole suite of give us your data set and we will try to train a model for it.

Other companies as well. I think Microsoft probably has their own version of that. Amazon as well. So there’s definitely proprietary kind of versions of this. I think what I particularly wanted to do was: One, this is something that, again, focusing on specifically our core user base, which is that civic technology type of person, they don’t typically have access to a lot of money to pay Google or to pay Amazon or whatever but they might have access to their laptop and that has a decent CPU, or they might have, they might even have a desktop that has a GPU or something.

And so I wanted to make it possible where it’s like we can, quickly annotate, like we can demonstrate exactly how to annotate this data and also demonstrate using consumer grade GPU. I think the GPU that I used in the example documentation for all of this is five, six years old or something at this point.

And show that it still works totally fine and it trains in under an hour. Everything is able to be done and let’s say a weekend of time. And I think that was really needed. I’ve already heard from other people, whether it’s, other research labs or other people trying to deploy all of our Council Data Project infrastructure that they are able to quickly iterate and train these models just on their own local machine.

[00:08:33] Arfon: Yeah, that’s really cool. You mentioned civic technology. I think I heard you say civic hacking space before. I think that the idea of publishing tools for communities that don’t have tons of resources and don’t necessarily, aren’t able to pay for services is really important.

And I was fortunate to be in Chicago when the civic, hacking scene was really starting up in the late

[00:08:54] Eva: Oh,

[00:08:55] Arfon: 2000, Yeah, about 2009 2010, and that was a really vibrant community, and I think still is, and there’s just lots

[00:09:01] Eva: Yeah. Chi Hack Night is

[00:09:03] Arfon: Chi Hack Night, there you go, yeah, lots of open, civic tech is, it’s a really cool, space, yeah, cool, okay.

Okay, so I can run this tool on, a relatively modest GPU, but what’s actually happening under the hood?

So there’s this transformer thing, there’s some learning that happens, what, does it look like to actually train Speakerbox for your project, for your dataset?

[00:09:27] Eva: Yeah. fortunately for us, Transformers is I think two things. So when people say the word Transformers, it’s, a couple of things all in one. There’s Transformers the architecture, which I think we can maybe shove aside and say it’s the kind of quote foundation of many modern contemporary machine learning and natural language processing models.

It’s very good at what it does and it works based off of trying to look for patterns across the entire sequence, right? So looking for whole sequences of data and saying where are these, things similar? So that’s great. But there’s also Transformers, the larger Python package and ecosystem, which is run by a company called Hugging Face.

And that’s, I think, where, at least now, honestly, maybe most people think of oh, I’m using Transformers. Specifically in my case, and in the Speakerbox case, we built it off of Transformers because there are tons of existing models that we can use to make this process of speaker identification or speaker classification much, much, easier.

Specifically there are models already pre trained and available on I think it’s called the Vox Fox something data set. I forget what the name is. But the general idea was that, people have already trained a transformer model to identify something like 1000 different people’s voices.

And to my understanding, they’re all celebrities, right? So you can imagine this. training process as, okay, given a bunch of different audio samples of celebrities voices, can we identify which celebrities voices these are? In our case, we don’t care about the celebrities, right? For many, if you want to train a model for your lab or you want to train it for a city council meeting or whatever it is you care about the people in your meeting.

And so the process there is called fine tuning, where really the main task is you’re taking off that last layer of the model and instead of saying, I want you to pick between one out of a thousand celebrities, I want you to pick out of one out of in this call, I want you to pick out of one of three people, right?

And by doing so, you can quickly give it some extra samples of data. So we can give it a couple, let’s say 20 samples of Arfon speaking. We can give it 20 samples of Abby speaking. We can give it 20 samples of me speaking. And hopefully just because of the existing pre training it, it’s learned the basics of how people speak and how people, converse.

And all we’re really doing is saying. Don’t care about the basics of how people speak and how people converse. Really focus on the qualities of this person’s voice and the qualities of this person’s voice, and so on. Hopefully that answers the question.

[00:11:50] Arfon: is that why, is that, those examples, is that what we mean by the few shot learning? So you give a few examples, effectively, of labeled This is Abby speaking, or this is Eva speaking. is that what we mean by the few shot?

[00:12:03] Eva: Exactly. In our case we’re doing a little bit of a hack. We say few shot learning, but it’s actually not really because it’s very common for few shot learning to literally be five examples or something less than 10. But really because audio samples can become really long, like I’m answering this question in a one minute phrase or a one minute response we can split up that audio into many, smaller chunks that act as potentially better smaller examples.

And so you may give us five examples, but in the process of actually training the model in Speakerbox, we might split that up into many smaller ones. So we’re kind of cheating, but,

[00:12:38] Arfon: Interesting. Thanks.

[00:12:40] Abby: Yeah, that’s awesome. And I do really like the, the Hugging Face platform, where there’s so many models that you can tweak. Are you sharing back to the Hugging Face platform at all? Or,

[00:12:50] Eva: I haven’t, I’m not. We have used some of the models trained from Speakerbox in our own work with Council Data Project specifically for the Seattle City Council, because that’s, my home council. And we’ve trained a model to identify city council members for Seattle, and we’ve applied it across a number, I think, almost 300 transcripts.

But that was just purely for analysis. We haven’t pushed that model up or anything, and we probably should. That should probably even be part of the process or part of the pipeline. Yeah, to come. I’m sure other people who may have used the system might have or something, but,

[00:13:20] Abby: you did publish in JOSS, the Journal of Open Source Software. Can you talk a bit about why you decided to publish there, and how the experience was?

[00:13:27] Eva: Yeah. So to be honest, I love JOSS. This is my second article in JOSS. We’ve also published one for the larger ecosystem of Council Data Projects as well. I think after the first time publishing with JOSS, I was just really, really, impressed with the reviewing process. More so than some other journals, right?

I think the requirements of having reviewers try to use your package that you’re pushing, saying this is ready for use, or this is useful in X domain. I think having reviewers try to use it is such a It’s such a good thing to do simply because one, like we know that it’s available, we know it’s documented, but also we, know that it works and I think I’ve just been frustrated by the number of times I’ve seen a paper on arxiv or something that links to their code, but there’s no real documentation or something.

And so, to me, it’s like I don’t know, I, just, one, like to keep supporting what JOSS does because I really appreciate, I’ve, read a number of papers from JOSS where I’m like, that tool seems great, I’m going to go use it, right? Yeah, I don’t know, it just, it seems like the default place in my mind now, where if I want, to come off as like a trusted developer of software or a trusted developer of research software, that’s the place to go.

[00:14:35] Arfon: Yeah, and it’s funny you say having people actually use your software. This is one of the founding questions that a few of us were asking, which was, it was possible to publish a paper about software in some disciplines, but the question I was asking people on Twitter at the time, I’m not really very active there now, but was like, when you have managed to publish a paper about software, did anybody look at the software?

And people were like, nope. Universally, nobody ever said yes. And I was like, huh, that seems really weird to me, that you would publish software, but nobody would look at the software, that seems. I’m glad that resonates for you and provides value.

[00:15:17] Eva: I was gonna, I was gonna add, I think the open reviewer process is very, very, nice. Because at least when I was writing the first paper that we wrote for JOSS, I could go and look at, other review processes and say, okay, what are they expecting? What might we need to change ahead of time, etc. And I could just, prepare myself, especially as like a, now a PhD student, sometimes publishing is scary and being able to look at a review process drastically lowers that concern or that worry as well.

[00:15:47] Arfon: Yeah, definitely. I mean, it’s good and bad with open review. It’s good for transparency and people are generally very well behaved. It’s bad because it’s, I think, amongst other things, intimidating is probably, it could be exclusionary to some folks who don’t feel like they could participate in that kind of review.

I’m, but on balance, I feel like it’s the right decision for this journal. I definitely think it’s an important part of what JOSS does.

I was going to ask if there was anything it sounds like you’d eyeballed and seen some reviews before you’d submitted, but was there anything particularly important about the review for Speakerbox that was a particularly good contribution from the reviewers or anything that was surprising, , anything particularly memorable from that

review?

[00:16:33] Eva: Speakerbox , because it’s designed to act in a workflow manner, it’s not so much, a library of here’s a single function or here’s like a nice function, two or three functions to use for your processing. It’s very much designed as a workflow, right?

Where you say, I have some set of data. I need to annotate that data first and then I need to, prep it for training. And then I need to finally train the model. And I might need to evaluate as well. And because it’s laid out in that workflow kind of style. The reviewers were having trouble from a normal package where you say, I’m just going to go to the readme, look at a quick start, and use it.

And I think that, honestly, the biggest piece was just, the reviewers just kept saying, I’m a little confused as to how to actually use this. And ultimately I very quickly made a video demonstrating exactly how to use it and I think that was the best contribution was just going back and forth and saying okay, I see your confusion here.

I see your confusion here. And ultimately leading me to just be like, okay, I’m going to take everything that they just said, make a video and also take any of the things that I did in the video. And add them to the readme, right?

By literally forcing myself to do it on a toy data set, I can add all the little extra bits of places that were possibly confusing for them as well. Using my own data versus using some random other data, there’s always going to be, things that you forget to add to the readme or whatever it is.

[00:17:46] Abby: That’s awesome. And one thing I really, I do like about the JOSS Review System is just how it makes, it just adds so many best practices for open source and makes it so much easier for others to, to run it and to contribute. So is Speakerbox open for a contribution if people wanted to jump in? Can they?

[00:18:02] Eva: Yeah, totally. Speakerbox. Yes, I think there are some existing minor, little bugs, I think available. There’s also probably tons of optimizations or just UX improvements that could definitely happen. Yes, totally open for contribution, totally happy to review PRs or respond to bugs and stuff.

[00:18:20] Abby: That’s great, especially since you published this a while ago, that you’re still being responsive there on GitHub. But what skills do people need if they want to jump in?

[00:18:28] Eva: I think the primary thing, there’s maybe two areas, one is, I think a very good background in let’s say data processing and machine learning in Python. Like those two things together as a single trait might be nice. Or on the opposite end of things, if you just want to try and train your own, like just use the system as is. You might experience bugs or something and just making a documentation post, right? like say "I experienced this bug, here’s how I got around it". Or, posting a GitHub issue and saying, there’s this current bug for someone else to gather. I think truly trying to fix stuff if you have a machine learning background or just trying to build a model are both totally welcome things.

[00:19:06] Abby: That’s awesome. Yeah, and I love how you’re seeing like using the model as a contribution, and just like documenting what you run into.

[00:19:12] Arfon: So I was gonna I’m gonna put my product manager hat on for a minute and ask you what questions should we have asked you today that we haven’t done yet?

[00:19:21] Eva: Woo.

[00:19:21] Arfon: Okay if there isn’t a question, but it’s the magic wand question for, what, could this product do if we if it could do anything? What should we have asked you as hosts that we didn’t, that we didn’t check in with you about today?

[00:19:35] Eva: I think it’s not so much a question, maybe what’s like the feature that you dream of,

[00:19:39] Arfon: Yeah,

[00:19:40] Eva: if, it’s true, if it’s truly like a product manager hat on then I think the feature we’ve previously talked a lot about okay, the, workflow itself is pretty simple and there are pretty good packages nowadays for shipping a webpage via Python. And it’d be very nice to just have the entire workflow as a user, like as a nice user experience process, like you you launch the server on your terminal and then you can do everything in the webpage especially for the users that aren’t as comfortable in the terminal, right?

That would be very, very, nice. And something that I just never got around to. it’s not my priority. But something that we’ve thought about as it would definitely reduce the barrier of use or the barrier of effort or whatever.

[00:20:21] Abby: Yeah. Oh, that’s really cool. Especially in the civic tech space, I can see that being like really game changing for a lot of cities.

[00:20:28] Arfon: Yeah.

[00:20:28] Eva: Yeah. I think civics and journalists as well, I think. But yeah.

[00:20:32] Arfon: Yeah. Very cool. Cool.

Hey, Eva, this has been a great, conversation. Thanks for telling us about Speakerbox. Just to close this out, I was curious how can people follow your work, keep up to date with what you’re working on today? Is there any place you would want people to go to keep track of what you’re working on?

[00:20:49] Eva: Yeah my website is evamaxfield.github.io. I will occasionally post updates on papers and stuff there. I think Twitter might be the best place you can find me on Twitter at @EvaMaxfieldB as well. So

[00:21:02] Arfon: Fantastic. thanks for being part of the podcast. It’s been really fun to learn about your software and thanks for your time today.

[00:21:09] Eva: yeah, thanks for having me.

[00:21:10] Abby: Thank you so much for listening to Open Source for Researchers. We love to showcase open source software built by researchers for researchers, so you can hear more by subscribing in your favorite podcast app. Open Source for Researchers is produced and hosted by Arfon Smith and me, Abby Cabunoc Mayes, edited by Abby, and the music is CC-BY Boxcat Games.

[/expand]

Video

25 Jan, 2024
announcements josscast

Introducing JOSSCast: Open Source for Researchers 🎉

Subscribe Now: Apple, Spotify, YouTube, RSS

We’re thrilled to announce the launch of “JOSSCast: Open Source for Researchers” - a podcast exploring new ways open source can accelerate your work. Hosted by Arfon Smith and Abby Cabunoc Mayes, each episode features an interview with different authors of published papers in JOSS.

There are 3 episodes available for you to listen to today! This includes “#1: Eva Maxfield Brown on Speakerbox – Open Source Speaker Identification for Political Science” and “#2: Astronomy in the Open – Dr. Taylor James Bell on Eureka!” along with a special episode #0 with hosts Arfon and Abby.

Tune in to learn about the latest developments in research software engineering, open science, and how they’re changing the way research is conducted.

New episodes every other Thursday.

Subscribe Now: Apple, Spotify, YouTube, RSS

9 Oct, 2023
announcements

Call for editors

Once again, we’re looking to grow our editorial team at JOSS!

Since our launch in May 2016, our existing editorial team has handled nearly 3000 submissions (2182 published at the time of writing, 265 under review) and the demand from the community continues to be strong. JOSS now consistently publishes a little over one paper per day, and we see no sign of this demand dropping.

New editors at JOSS are asked to make a minimum 1-year commitment, with additional years possible by mutual consent. As some of our existing editorial team are reaching the end of their term with JOSS, the time is right to bring on another cohort of editors.

Background on JOSS

If you think you might be interested, take a look at our editorial guide, which describes the editorial workflow at JOSS, and also some of the reviews for recently accepted papers. Between these two, you should be able to get a good overview of what editing for JOSS looks like.

Further background about JOSS can be found in our PeerJ CS paper, which summarizes our first year, and our Editor-in-Chief’s original blog post, which announced the journal and describes some of our core motivations for starting the journal.

More recently we’ve also written in detail about the costs related with running JOSS, scaling our editorial processes, and talked about the collaborative peer review that JOSS promotes.

How to apply

Firstly, we especially welcome applications from prospective editors who will contribute to the diversity (e.g., ethnic, gender, disciplinary, and geographical) of our board.

✨✨✨ If you’re interested in applying please fill in this short form by 6th November 2023. ✨✨✨

Who can apply

Applicants should have significant experience in open source software and software engineering, as well as expertise in at least one subject/disciplinary area. Prior experience with peer-review and open science practices are also beneficial.

We are seeking new editors across all research disciplines. The JOSS editorial team has a diverse background and there is no requirement for JOSS editors to be working in academia.

Selection process

The JOSS editorial team will review your applications and make their recommendations. Highly-ranked candidates will then have a short (~30 minute) phone call/video conference interview with the editor(s)-in-chief. Successful candidates will then join the JOSS editorial team for a probational period of 3 months before becoming a full member of the editorial team. You will get an onboarding “buddy” from the experienced editors to help you out during that time.

Dan Foreman-Mackey, Olivia Guest, Daniel S. Katz, Kevin M. Moerman, Kyle E. Niemeyer, Arfon M. Smith, Kristen Thyng

JOSS publishes 2000th paper

Arfon M. Smith

This week JOSS reached a big milestone – publishing our 2000th paper! It also happens to be our 7th birthday, and we thought we’d take this opportunity to review our submission stats from the last few years, discuss some of the changes to JOSS we’ve made of late, and reflect on some of the challenges we have faced as a journal.

Submission summary

Everything discussed here is derived from the amazing work of one of our editors (thanks Charlotte!) who created our submission analytics page which is built nightly based on data from the JOSS API. If you want to dig more into this analysis, the source code is available for you to do so.

High level publication stats

2000 papers over 7 years means, on average, we’ve published 285 papers per year. An average isn’t quite right though of course, with our throughput being substantially lower in the early days.

Year 1 (May 2016 – May 2017): 57
Year 2 (May 2017 – May 2018): 138
Year 3 (May 2018 – May 2019): 254
Year 4 (May 2019 – May 2020): 345
Year 5 (May 2020 – May 2021): 338
Year 6 (May 2021 – May 2022): 369
Year 7 (May 2022 – May 2023): 366

Note that JOSS closed for submissions between March 2020 and May 2020 due to the COVID-19 pandemic. This likely accounts for the drop in publications in year 5 (May 2020 – May 2021). Looking at the high-level breakdown, it seems like we’ve reached some kind of plateau of around one paper published per day. If we were a business looking to grow revenue this lack of year over year growth might be of concern. However, as a volunteer-run journal, this is OK with most of us :-)

Submission scope and rejections

In July 2020 we introduced a test for ‘substantial scholarly effort’ for all new submissions. You can read more about the motivations for this in our blog post but this clearly had an effect on our rejection rate both in the pre-review stage, and during/post review.

Screenshot_2023-05-03_at_11_27_29

We now reject between 15-30% papers at the pre-review stage, and around 5% during the actual review process itself (note this is in line with our goal to provide a constructive review process, improving submissions such that they can be accepted).

Authorship

JOSS is no longer dominated by single author submissions. In fact, we are seeing some evidence of more authors per submission with the fraction of submissions with more than 5 authors now approaching 25%.

Screenshot 2023-05-03 at 11 28 53

Authors

Citations

More than 200 papers have been cited > 20 times. About a third have never been cited. A few papers have been cited a lot: Welcome to the Tidyverse currently has nearly 6000 citations.

Papers in more than 6000 different venues have cited a JOSS paper, with the most common being bioRxiv, JOSS, Scientific Reports, Monthly Notices of the Royal Astronomical Society, and The Astrophysical Journal.

Editor and Reviewer statistics

2093 unique individuals have contributed reviews for the 2000 papers published in JOSS, including 8 amazing individuals who have contributed more than 10 reviews each!

JOSS currently has 77 editors, 6 of whom are track Editors-in-Chief, and one Editor-inChief. 112 editors have served in total on the editorial board.

Unfortunately, our reviews are getting slower. We’re not really sure why, but this has been a noticeable change from our earlier days. Our average time now spent in review is approaching four months, whereas pre-COVID it was under three.

Software statistics

JOSS reviews are primarily about the software, and so it would be remiss of us not to talk about that. Python is still the #1 language for JOSS submissions, used in part for well over half of published papers (~1200 out of 2000). R is #2 at 445 submissions, and C++ #3 (although of course C++ and C may be used together with another language).

Anecdotally we’re seeing an increase in the number of authors submitting to JOSS to declare their submission ‘ready for use’ or declaring as a v1.0 release. This is also potentially supported by the peak close to 0 days between the repository creation date and the submission to JOSS.

Screenshot 2023-05-04 at 10 29 00

The MIT, GPL (v3), and BSD 3-Clause licences are still used for the majority of submissions.

Screenshot 2023-05-03 at 11 42 01

Investments in tooling and infrastructure

JOSS has benefited from the kind support of the Alfred P. Sloan Foundation, the Gordon and Betty Moore Foundation, and a number of small development grants from NumFOCUS. This support has enabled JOSS to invest further in our infrastructure and tooling, highlights of which we describe below:

Editorialbot for all of your GitHub-based review needs

Our former bot Whedon has now become Editorialbot, which is much more than a rename. Editorialbot is now its own open source framework that can be used to manage review-like interactions on GitHub (i.e., not just publishing workflows). Currently, Editorialbot is used by rOpenSci, JOSS, SciPy Proceedings, with more coming soon. Thanks to Juanjo Bazán for all of his great work here!

As part of this work we’ve also extracted all of the legacy capabilities from the Whedon codebase to run as a series of GitHub Actions workflows.

Investments in our Pandoc-based publishing pipeline

JOSS has always used Pandoc to produce our PDF papers and Crossref XML metadata, but the way we used it was… hacky. Over the past few years we’ve been fortunate to work directly with one of the Pandoc core team (Albert Krewinkel) to implement a number of improvements to how we use Pandoc within the JOSS publishing system, and also to contribute new capabilities to Pandoc itself, including improved ConTeXt support, support for JATS outputs, and more sophisticated handling of author names in the Pandoc frontmatter.

Editorial tooling improvements

Last but not least, we’ve also made a bunch of small (and not so small) changes to the way that we handle submissions as an editorial team, including implementing tracks (and appointing a track EiC for each) and a reviewer management tool for searching for appropriate reviewers and tracking reviewer assignments.

Thank you!

Over these seven years of operations, JOSS simply wouldn’t work without the dedicated volunteer editors and reviewers (and of course the authors submitting their work).

References

Höllig J, Kulbach C, Thoma S. TSInterpret: A Python Package for the Interpretability of Time Series Classification. Journal of Open Source Software. 2023;8(85):5220. doi:10.21105/joss.05220

Smith AM. Minimum publishable unit. Published online July 7, 2020. doi:10.59349/dfy0f-3y061

Wickham H, Averick M, Bryan J, et al. Welcome to the Tidyverse. Journal of Open Source Software. 2019;4(43):1686. doi:10.21105/joss.01686

Call for editors

Dan Foreman-Mackey, Olivia Guest, Daniel S. Katz, Kevin M. Moerman, Kyle Niemeyer, Arfon M. Smith, George K. Thiruvathukal, Kristen Thyng

Once again, we’re looking to grow our editorial team at JOSS!

Since our launch in May 2016, our existing editorial team has handled over 2000 submissions (1838 published at the time of writing, 215 under review) and the demand from the community continues to be strong. JOSS now consistently publishes a little over one paper per day, and we see no sign of this demand dropping.

Background on JOSS

More recently we’ve also written in detail about the costs related with running JOSS, scaling our editorial processes, and talked about the collaborative peer review that JOSS promotes.

How to apply

Firstly, we especially welcome applications from prospective editors who will contribute to the diversity (ethnic, gender, disciplinary, and geographical) of our board.

✨✨✨ If you’re interested in applying please fill in this short form by 22nd December 2022. ✨✨✨

Who can apply

We welcome applications from potential editors with significant experience in one or more of the following areas: open source software, open science, software engineering, peer-review.

The JOSS editorial team has a diverse background and there is no requirement for JOSS editors to be working in academia. Unfortunately individuals enrolled in a PhD program are not eligible to serve on the JOSS editorial team.

Selection process

References

Smith AM, Niemeyer KE, Katz DS, et al. Journal of Open Source Software (JOSS): design and first-year review. PeerJ Computer Science. 2018;4:e147. doi:10.7717/peerj-cs.147

Katz DS, Barba LA, Niemeyer K, Smith AM. Cost models for running an online open journal. Published online June 4, 2019. doi:10.59349/g4fz2-1cr36

Katz DS, Barba LA, Niemeyer K, Smith AM. Scaling the Journal of Open Source Software (JOSS). Published online July 8, 2019. doi:10.59349/gsrcb-qsd74

Call for editors: Astronomy & Astrophysics

Dan Foreman-Mackey

JOSS is continuing to grow, and we are looking to add more editors with expertise in the area of astronomy & astrophysics.

Since our launch in May 2016, our existing editorial team has handled nearly 1900 submissions (1684 published at the time of writing, 205 under review) and the demand from the community continues to grow. In particular, we have seen an increase in the number of astronomy & astrophysics submissions, beyond the capacity of our current editorial team.

Editors at JOSS make a minimum 1-year commitment, with additional years possible by mutual consent. With some of our existing editorial team reaching the end of their terms with JOSS and this increase in submissions, the time is right to bring new editors on board.

Background on JOSS

More recently we’ve also written in detail about our commitment to the Principles of Open Scholarly Infrastructure the costs related with running JOSS, scaling our editorial processes, and talked about the collaborative peer review that JOSS promotes.

Of specific interest to this call, we also have a collaboration with the American Astronomical Society (AAS) Journals to provide a parallel review for submissions with a significant software component.

Who can apply

We welcome applications from potential editors with research experience in astronomy and astrophysics, including but not limited to, open source software development for astrophysical simulations, data reduction, or statistical methods.

Members of the JOSS editorial team have diverse backgrounds, and we welcome JOSS editors from academia, government, and industry. We especially welcome applications from prospective editors who will contribute to the diversity (ethnic, gender, disciplinary, and geographical) of our board. We also value having a range of junior and senior editors.

How to apply

✨✨✨ To apply please fill in this short form by 31 July 2022. ✨✨✨

Selection process

The JOSS editorial team will review your applications and make recommendations. Highly-ranked candidates will then have a short (~30 minute) phone call/video conference interview with a current editor. Successful candidates will then join the JOSS editorial team for a probationary period of 3 months before becoming a full member of the editorial team. You will get an onboarding “buddy” from the experienced editors to help you out during that time.

References

Smith A. Announcing The Journal of Open Source Software - Arfon Smith. Published online May 5, 2016. Accessed July 12, 2022. https://www.arfon.org/announcing-the-journal-of-open-source-software

Smith AM, Niemeyer KE, Katz DS, et al. Journal of Open Source Software (JOSS): design and first-year review. PeerJ Computer Science. 2018;4:e147. doi:10.7717/peerj-cs.147

Katz DS, Smith AM, Niemeyer K, Huff K, Barba LA. JOSS’s Commitment to the Principles of Open Scholarly Infrastructure. Published online February 14, 2021. doi:10.59349/m5h23-pjs71

Katz DS, Barba LA, Niemeyer K, Smith AM. Cost models for running an online open journal. Published online June 4, 2019. doi:10.59349/g4fz2-1cr36

Katz DS, Barba LA, Niemeyer K, Smith AM. Scaling the Journal of Open Source Software (JOSS). Published online July 8, 2019. doi:10.59349/gsrcb-qsd74

Smith AM. A new collaboration with AAS publishing. Published online December 19, 2018. doi:10.59349/wj1gg-tsg49

Call for editors

Arfon M. Smith

JOSS is continuing to grow, and we are looking to add more editors again. We’re especially interested in recruiting editors with expertise in bioinformatics, neuroinformatics/neuroimaging, material science, ecology, machine learning & data science, and the social sciences.

Since our launch in May 2016, our existing editorial team has handled over 1800 submissions (1200 published at the time of writing, 170 under review) and the demand from the community continues to grow. The last three months have been our busiest yet, with JOSS publishing more than one paper per day, and we see no sign of this demand dropping.

Editing for a journal during a pandemic

After a pause on submissions in early 2020, JOSS has been open for submissions during most of the pandemic. We recognize that making time for volunteer commitments such as JOSS is especially challenging at this time and are taking steps to reduce the load on authors, editors, and reviewers and continually striving to find the right balance between accommodating the very real challenges many of us now face in our daily lives, and providing a service to the research software community.

Editing for JOSS is a regular task, but not one that takes a huge amount of time. JOSS editors are most effective if they are able to check in on their submissions a couple of times per week. Our goal is that a JOSS editor handles about three submissions at any one time making for about 25 submissions per year.

Background on JOSS

Who can apply

We welcome applications from potential editors with significant experience in one or more of the following areas: open source software, open science, software engineering, peer-review, noting again that editors with expertise in bioinformatics, neuroinformatics/neuroimaging, material science, ecology, machine learning & data science, and the social sciences are most needed.

Members of the JOSS editorial team have diverse backgrounds and we welcome JOSS editors from academia, government, and industry. We especially welcome applications from prospective editors who will contribute to the diversity (ethnic, gender, disciplinary, and geographical) of our board. We also value having a range of junior and senior editors.

How to apply

✨✨✨ To apply please fill in this short form by 23 April 2021. ✨✨✨

Selection process

The JOSS editorial team will review your applications and make recommendations. Highly-ranked candidates will then have a short (~30 minute) phone call/video conference interview with the editor(s)-in-chief. Successful candidates will then join the JOSS editorial team for a probationary period of 3 months before becoming a full member of the editorial team. You will get an onboarding “buddy” from the experienced editors to help you out during that time.

Thanks to our editors who are stepping down

A few of our editors are completing terms and stepping down from editorial duties at JOSS. Lorena A Barba (@labarba), Kathryn Huff (@katyhuff), Karthik Ram (@karthik), and Bruce E. Wilson (@usethedata) have been amazing editors to have on the team and we will miss them very much!

References

Smith A. Announcing The Journal of Open Source Software - Arfon Smith. Published online May 5, 2016. Accessed July 12, 2022. https://www.arfon.org/announcing-the-journal-of-open-source-software

Smith AM. Reopening JOSS. Published online May 18, 2020. doi:10.59349/4tz9w-yq369

Smith AM, Niemeyer KE, Katz DS, et al. Journal of Open Source Software (JOSS): design and first-year review. PeerJ Computer Science. 2018;4:e147. doi:10.7717/peerj-cs.147

Katz DS, Smith AM, Niemeyer K, Huff K, Barba LA. JOSS’s Commitment to the Principles of Open Scholarly Infrastructure. Published online February 14, 2021. doi:10.59349/m5h23-pjs71

Katz DS, Barba LA, Niemeyer K, Smith AM. Cost models for running an online open journal. Published online June 4, 2019. doi:10.59349/g4fz2-1cr36

Katz DS, Barba LA, Niemeyer K, Smith AM. Scaling the Journal of Open Source Software (JOSS). Published online July 8, 2019. doi:10.59349/gsrcb-qsd74

JOSS's Commitment to the Principles of Open Scholarly Infrastructure

Daniel S. Katz, Arfon M. Smith, Kyle Niemeyer, Kathryn Huff, Lorena A. Barba

The Journal of Open Source Software (JOSS) is committed to the Principles of Open Scholarly Infrastructure, and here we summarize our status in doing so, followed by a more detailed discussion of how we do so, as well as explaining some when we do not, and some work in progress.

This document was assembled by Daniel S. Katz, Arfon Smith, Kyle E. Niemeyer, Kathryn D. Huff, and Lorena A. Barba, and reviewed and approved by the active JOSS editorial board and topic editors, and the Open Journals Steering Council.

Summary

Governance
💛 Coverage across the research enterprise
💛 Stakeholder Governed
💚 Non-discriminatory membership
💚 Transparent operations
💚 Cannot lobby
💛 Living will
💚 Formal incentives to fulfil mission & wind-down

Sustainability
💚 Time-limited funds are used only for time-limited activities
💛 Goal to generate surplus
💛 Goal to create contingency fund to support operations for 12 months
💚 Mission-consistent revenue generation
💚 Revenue based on services, not data

Insurance
💚 Open source
💚 Open data (within constraints of privacy laws)
💚 Available data (within constraints of privacy laws)
💚 Patent non-assertion

(💚 = good, 💛 = less good)

Discussion

Governance

💛 Coverage across the research enterprise

It is increasingly clear that research transcends disciplines, geography, institutions and stakeholders. The infrastructure that supports it needs to do the same.

Research software is essential to all types of research, and in response, JOSS’s coverage includes research software in any discipline, from any place, and from any institution.

The scope of what we publish is only limited by a few guidelines. JOSS publications must be research software of sufficient scholarly effort, which means that some software essential to research is excluded because it is either not research software, (e.g., a C compiler) or too small (e.g., a few hundred lines of Python that implement an existing tool or that provides a wrapper to access a fine-grained data source.)

JOSS strives for broad coverage of research disciplines in its editorial board, to better serve a broad community of authors.

💛 Stakeholder Governed

A board-governed organisation drawn from the stakeholder community builds more confidence that the organisation will take decisions driven by community consensus and consideration of different interests.

Open Journals is fiscally sponsored by NumFOCUS, and has a documented governance structure. The steering council, being a small group, is limited in its representation, in terms of geographic, ethnic, gender, and organizational diversity. The editorial board members mostly represent North America and Europe, are mostly white, are mostly male, and are mostly hands-on researchers, primarily from universities and national laboratories.

💚 Non-discriminatory membership

We see the best option as an “opt-in” approach with a principle of non-discrimination where any stakeholder group may express an interest and should be welcome. The process of representation in day to day governance must also be inclusive with governance that reflects the demographics of the membership.

Additions to the editorial board, which is the first layer of governance, are made via selections from responses to open calls and are non-discriminatory.

💚 Transparent operations

Achieving trust in the selection of representatives to governance groups will be best achieved through transparent processes and operations in general (within the constraints of privacy laws).

JOSS is publicly transparent, much more so than most other journals. Few issues are not publicly open, but these are generally open to all editors, including initial discussions about potential changes to the journal and discussions about the scope of submissions that may or may not be accepted for review. Even though some of these discussions may not occur in the open, we do publish the decisions themselves, along with an explanation.

💚 Cannot lobby

The community, not infrastructure organisations, should collectively drive regulatory change. An infrastructure organisation’s role is to provide a base for others to work on and should depend on its community to support the creation of a legislative environment that affects it.

The vast majority of time and effort used at JOSS is operational, with improvements of JOSS next, and finally some publicity about JOSS, which could be considered lobbying. However, this lobbying is mostly done by the JOSS editors because they think it is important to them, not to JOSS. The fact that they are also JOSS editors is a consequence of their feelings about the importance of recognizing contributions to research software, which also leads them to talk about this with others.

💛 Living will

A powerful way to create trust is to publicly describe a plan addressing the condition under which an organisation would be wound down, how this would happen, and how any ongoing assets could be archived and preserved when passed to a successor organisation. Any such organisation would need to honour this same set of principles.

As discussed below, there are circumstances in which we would consider the mission of JOSS fulfilled and the journal no-longer necessary. While we have not documented a plan for winding down JOSS, we believe that the core assets associated with the journal (software, article metadata, papers) are appropriately preserved as part of our ongoing operations. The articles published in JOSS are persistently archived such that an end of the journal will not affect the scholarly record.

💚 Formal incentives to fulfil mission & wind-down

Infrastructures exist for a specific purpose and that purpose can be radically simplified or even rendered unnecessary by technological or social change. If it is possible the organisation (and staff) should have direct incentives to deliver on the mission and wind down.

JOSS views itself as a temporary solution to provide a means for software developers and maintainers to receive credit for their work, and to have this work (research software) improved by the process of open peer review. We look forward to a time when software papers are not needed, when software is directly recognized and cited, and when software peer review (potentially using a future version of our criteria and processes) is more widespread. JOSS is volunteer-run as a service to the community, and most of the volunteers will be happy when a solution like JOSS is no longer needed, because software has found a more direct avenue to be valued and counted in the scholarly record.

Sustainability

💚 Time-limited funds are used only for time-limited activities

Day to day operations should be supported by day to day sustainable revenue sources. Grant dependency for funding operations makes them fragile and more easily distracted from building core infrastructure.

JOSS does not depend on grants for regular operations, but has attracted grant funding for specific activities to improve the tooling that facilitates running JOSS. We make effective use of time-limited funds such as grants to support enhancements to our services.

💛 Goal to generate surplus

Organisations which define sustainability based merely on recovering costs are brittle and stagnant. It is not enough to merely survive, it has to be able to adapt and change. To weather economic, social and technological volatility, they need financial resources beyond immediate operating costs.

As described in our blog post on the topic, our operational costs are deliberately very low. We currently do not generate a surplus and have no plans to. We also do not employ any staff and so “economic, social and technological volatility” would be expected to have limited impact on JOSS.

💛 Goal to create contingency fund to support operations for 12 months

A high priority should be generating a contingency fund that can support a complete, orderly wind down (12 months in most cases). This fund should be separate from those allocated to covering operating risk and investment in development.

We have sufficient funds available today to support our operations for substantially longer than 12 months. Allocating some of these to a formal contingency fund is something we are considering but have not yet done. As a fiscally sponsored project of NumFOCUS, JOSS can receive donations from individuals, and we can kick off a fund-raising campaign at short notice. JOSS can also apply for NumFOCUS Small Development Grants, which are awarded several times per year.

💚 Mission-consistent revenue generation

Potential revenue sources should be considered for consistency with the organisational mission and not run counter to the aims of the organisation. For instance…

JOSS revenue comes from three sources: a small amount from donations, a small amount from the American Astronomical Society (AAS) as fees for reviews of the software linked to AAS publications, and a larger amount from grants to the journal related to demonstrating its effectiveness, promoting the importance of research software, and recognizing research software’s contributors. These sources of revenue are fully consistent with the mission of JOSS.

💚 Revenue based on services, not data

Data related to the running of the research enterprise should be a community property. Appropriate revenue sources might include value-added services, consulting, API Service Level Agreements or membership fees.

JOSS receives no revenue for its data, which is completely open, but rather receives revenue for its services and for its community impact.

Insurance

💚 Open source

All software required to run the infrastructure should be available under an open source license. This does not include other software that may be involved with running the organisation.

All of JOSS’ tools are open source and available on GitHub under the Open Journals organization. This includes the JOSS website, our editorial bot Whedon, and the document production toolchain. Some of the collaboration tools we use as an editorial team are not open (e.g., GitHub, Slack, Google Docs), but these are not critical to the functioning of the journal and could be replaced by open alternatives.

💚 Open data (within constraints of privacy laws)

For an infrastructure to be forked it will be necessary to replicate all relevant data. The CC0 waiver is best practice in making data legally available. Privacy and data protection laws will limit the extent to which this is possible

Our papers and the (Crossref DOI) metadata associated with them are available on GitHub, with an open license. We deposit open citations with Crossref, and archive papers and our reviews with Portico.

💚 Available data (within constraints of privacy laws)

It is not enough that the data be made “open” if there is not a practical way to actually obtain it. Underlying data should be made easily available via periodic data dumps.

Our papers and the (Crossref DOI) metadata associated with them are available on GitHub, with an open license. These data are easily accessible to all motivated to make use of them.

We could potentially create data exports of the JOSS web application database; however this would just be an alternative representation of the data already available.

💚 Patent non-assertion

The organisation should commit to a patent non-assertion covenant. The organisation may obtain patents to protect its own operations, but not use them to prevent the community from replicating the infrastructure.

JOSS has no interest in patents, other than resisting the creation of patents that might prevent us from operating freely.

References

POSI. The Principles of Open Scholarly Infrastructure. The Principles of Open Scholarly Infrastructure. Accessed February 21, 2021. https://openscholarlyinfrastructure.org/

Episode Highlights

Links

Transcript

Video

Links

Transcript

Video

Links

Transcript

Video

Background on JOSS

How to apply

Who can apply

Selection process

Submission summary

Editor and Reviewer statistics

Software statistics

Investments in tooling and infrastructure

References

Background on JOSS

How to apply

Who can apply

Selection process

References

Background on JOSS

Who can apply

How to apply

Selection process

References

Editing for a journal during a pandemic

Background on JOSS

Who can apply

How to apply

Selection process

Thanks to our editors who are stepping down

References

Summary

Discussion

Governance

Sustainability

Insurance

References