Skip to content

Exploratory Data Analysis Made Easy At The Command Line - Episode 230

By Saul Pwanson

This is an edited transcript of an interview about VisiData in this episode of Python.__init__.
It was carefully transcribed and edited by E (@cel10e on twitter) on 2019-10-30.

Enjoy!

Table of Contents

Tobias' intro [00:00]

[Intro music plays.]
Tobias Macey: Hello, and welcome to Podcast.init, the podcast about Python and the people who make it great.

When you're ready to launch your next app or want to try a project you hear about on the show, you'll need somewhere to deploy it. So take a look at our friends over at LINODE. With 200 gigabit private networking, scalable shared block storage, node balancers, and a 40 gigabit public network, all controlled by a brand new API, you get everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models and running your continuous integration, they just launched dedicated CPU instances. Go to pythonpodcast.com/linode — that's L-I-N-O-D-E — today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show.

And you listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For even more opportunities to meet, listen, and learn from your peers, you don't want to miss out on this year's conference season. We have partnered with organizations such as O'Reilly Media, Dataversity, Corinium Global Intelligence, and Data Council. Upcoming events include the O'Reilly AI Conference, the Strata Data Conference, the combined events of the Data Architecture Summit and Graph Forum and Data Council in Barcelona. Go to pythonpodcast.com/conferences today to learn more about these and other events, and take advantage of our partner discounts when you register.

Saul intro [01:37]

Tobias Macey: Your host as usual is Tobias Macey, and today I'm interviewing Saul Pwanson about VisiData, a terminal-oriented interactive multi-tool for tabular data. So, Saul, can you start by introducing yourself?

Saul Pwanson: Yeah, my name is Saul, I've been working in the software industry for a while, and I'm in Seattle.

Tobias Macey: And do you remember how you first got introduced to Python?

Saul Pwanson: Yeah, it was for work back in 2004. I was at a startup which was using it for their apps, and I picked it up there, because that's just what you do, right?

Tobias Macey: And, so, at some point, you decided that you needed to start building your own multi-tool for working with data, particularly in tabular formats, and I'm wondering if you could just start by describing a bit about the tool that you built in the form of VisiData, and some of the backstory of how it got started?

Saul Pwanson: Yeah. I was working at a company called F5 Networks in 2011, I think, and I built up an early prototype of VisiData for them. I didn't know it would become VisiData at the time, it was just a configuration and exploration tool for their own networking hardware. But, as I got used to using it, it was fun, I found that the concept was very flexible, and I kept adding more and more stuff to it.

And then after I left F5, I found myself missing it. I kept wanting to use it to look at HDF5 files or other tabular data that I had. And so, when I was at another company in 2016, and that company was winding down, I realized that I was turning 40, and looking back over my career of almost 20 years, I didn't have very much to show for it. And I was like, "well, I've been wanting this tool, so let me just start that again." I couldn't use the code that I had written at F5, obviously. So I was like, "if I'm gonna rewrite it, let me do it again from scratch, and I'll make it right once and for all as the tool that I want to use." And so that was the genesis of VisiData.

Tobias Macey: There are a couple interesting things in there. One is the fact that you had to try and rewrite this tool from scratch, largely from memory, without copying the exact code that you had written. And then also, you mentioned HDF5 files, and I'm curious: have you experimented with PyTables at all, or any of the other libraries for that particular data format, and what was it that you found lacking that you wanted to have present in VisiData?

Saul Pwanson: So I hadn't used PyTables, but I've used a lot of other Python tools for using HDF5 files. And just like with any complicated format, you can do anything you want to with Python, it's really great for that. But it's still kind of a hassle – you still have to write code to do it. And often, I wanted to just open up an HDF5 file and check it to see if that data reference isn't all zeros, or has tables that I want in there. And the tools that I had weren't good. There was one that was written in Java, and so of course you'd start it up and it would take like three seconds to start. It actually had a bug in it so that it would modify the data to be kind of truncated, it was read-only. And so every kind of tool for quick exploration didn't really work for that purpose. I was really missing that really rapid flow — just show me the data, I want to see it with my own eyes and then pop back out.

Tobias Macey: And then for the sake of you rewriting this tool from scratch, I'm curious if there was any functionality in the original tool that you either consciously left out because it didn't suit your particular needs of the time, or any functionality that you tried to replicate but weren't quite able to match, because you didn't have the necessary context that you had had at the time when you were working at F5?

Saul Pwanson: Yeah, that's a really great question. I've been working on VisiData now for a lot longer than I worked on the original prototype that I had made. But because it was a networking company, one of the things that was in the original version that I still haven't added but would like to add someday, was the ability to add a derivative column, because it was updating live from actual data on the device. For example, if you see the bytes transfer, then you could add another column that would be a derivative of bytes transfer: bytes transfer per second. And you could do that for any column you wanted to, which was a super handy feature.

But I am finding that VisiData is used more for static data now than the original tool was at F5, and so that feature hasn't been added. I do remember that the prototype had pop-ups — for example, if you wanted to change a field between 10 values and enumeration, there was a little modal pop-up that would show up, where you could scroll down and pick the right option. That was always a neat little feature, it made things a little easier to use. Right now in VisiData, if you have to add an aggregator, you look on the bottom line for the list, which I still don't quite like. But, I decided to not add modal dialogues to VisiData. If you want to see something, you have to go to a fresh sheet. If you're going to go to a fresh thing, the modality is just in the bottom status line. That was a conscious design decision. It looks flashier to have the pop-ups, but as I'm doing more design work, I see that modal dialogues kind of get in the way.

Tobias Macey: You mentioned that the common use case for VisiData now is for static data. I'm wondering if it does have any capacity for being able to process continuously updating information, such as the Top command in Linux, or the network streams that you had built the original tool for?

Saul Pwanson: Yeah, absolutely. And in fact, I would love for VisiData to have more adapters or plugins or loaders for other things, like Top. Actually, Top's a great example. And I've got some prototypes of vgit, but vgit's still more static data. I do have a vping, which goes out as a combination trace route and ping, it updates live as it's finding the various hops and their latencies, and stuff like that. But then it turns out that every one of these applications is a pile of work. I've been devoting so much to the core application of VisiData that I found myself not having a lot of time to polish the other ones to the degree that I want to. So I wish that I had that time, or that I could find somebody that wanted to work on it with me. But I've been focusing on the main end of it.

What are the main uses cases for VisiData? [07:22]

Tobias Macey: I'm curious what the main use cases are for VisiData that you have found, both personally and within the community. In particular, which tools has replaced in your toolbox for things like systems administration or data analysis, that you might reach for otherwise, but that VisiData is just more convenient for?

Saul Pwanson: Mm hmm. So it seems like one of the main use cases is to get a first look at downloaded data. I know that Jeremy Singer-Vine, who is the data editor at BuzzFeed, uses it all the time for exactly this purpose. Because you find data all the time online, and you don't know if it's useful, and you don't want to spend a lot of time investing in it — porting it into a database, for instance. We just want to see it right away, to see the first columns, the first rows, do a quick search or however else you want to view it. And to be able to get to that place instantly, as opposed to doing any amount of coding or work, is one of the huge benefits of VisiData. And so I would say that it's mainly super useful as an exploratory tool.

The other thing that I find it super useful for is getting data from one format and structure, into another format and structure. So you know, if you just need this one-off thing — you've got this pile of data here, whether it's an Excel spreadsheet, or even just piped in from another command, and you want to say, remove those four columns, add another computed column, and just save that off and pass it off, you can get all of that done in like 30 seconds. Whereas doing it in any other tool is going to take you at least several minutes to get those tools put together, then write the code, do whatever you're going to do, and it ends up taking more like a half an hour. And so it's a lot of those one-off scripts that I think VisiData has replaced for me.

Tobias Macey: Yeah, there are definitely utilities, both in the general user space libraries — particularly for Linux but also in Python — that can do direct conversion between formats. But the workflow you were describing, of being able to manipulate the information before you commit it to the other format, is definitely something that would typically require a lot more effort and exploration to make sure that you're getting things right. And then, once you do get it working, you're likely to use it repeatedly, but it's going to be much more static and brittle than if you were to use VisiData in a more exploratory fashion. And I know that VisiData also supports being able to build pipelines for that repeatable use case as well. So I'm wondering if you can just talk through an overall typical workflow for data exploration and analysis, and also the sort of conversion workflow that you might use VisiData with?

Saul Pwanson: Yeah, okay. So here's a typical workflow, for instance. I might download some data from the internet, for example from data.boston.gov — every city's got their own public datasets. And so you open it with vd to poke around and see what's there. I like to see the scope of the data and the precision: basically, how wide does it get and how much individual pieces do you get, as well as how clean it is and how useful it is for our purposes. And then, as I'm browsing around, I start finding things — like, oh, this column seems interesting, I wonder what's in here. And because there might be a million rows in the data set, it's hard to see that, especially since it's often sorted in a certain way. And so I do a frequency analysis very often, using Shift+F, which is one of my favorite things about VisiData. So you just use Shift+F on any column, and within a couple of seconds if not instantly, you can see the top value in that column and the number of values in that column total. And I think it's really great for finding any anomalies or outliers in the data. So you can use VisiData as a quick sanity check: "oh, I see, there's really no data in that field, or seldom is there data in that field," or "oh, that's interesting, why are there half the fields completely empty," et cetera. So that's one workflow.

What happens is, once I get the data into a state that I want, then it's pretty easy to revisit the command log in VisiData, cull it down to get only the stuff that I really want, and then save it off. And I can do that repeatedly, if that data might be updated, or I want to share it with somebody else. In fact, one of the things that's been interesting is how useful the command log has been for debugging. People can say, "this is what I've been doing with this, and this is the input data that I have, and here's the command log that I've been running over it." And it's interesting how I can look at the command log, and see, "oh, I wouldn't have known that you did that exact command here, or on that row." And that changes everything. That is the way that I can figure out what's going wrong. So the replay is a really useful debugging aid too.

The command log [11:48]

Tobias Macey: I'm wondering if you can talk a bit more about the command log. As you said, it's definitely frequent that you run through an exploratory cycle, and you finally get to a good state, but you don't necessarily remember all the steps that you've run through. You may have deleted code or added new code, so you don't know exactly what the flow was. Whereas by using a more keyboard oriented tool, you can keep that history and see what the overall workflow was. So I'm curious if you can just discuss the command log and how it manifests in VisiData, and some of the benefits and drawbacks that it might provide.

Saul Pwanson: So, as far as I'm concerned, VisiData is a grand experiment. Originally I made it as just a browsing tool — a CSV browser, as I used to call it. And then it turns out that it's a lot more broadly useful; it turns out that spreadsheets and tables are a very universal construct. I was at the Recurse Center in early 2017, just playing around, and I started to wonder if I can record all actions into a table itself. And I did that, and it didn't work at first, and I was recording all the motions and everything else, and it was a mess. But then once I took everything that didn't belong on the command log out, it worked remarkably well, and I think it's actually super handy. It's not as good, obviously, as a Python script. You can only have one input field, it's kind of a rigid structure. But given the limited amount of data that's on the command log, it's remarkably flexible and powerful. I'm very surprised about that.

There is one thing that I do wish that I could solve, and I'm sure that it is solvable, but I haven't managed it yet. The command log reports every action that you take, including all the dead ends that you wind up finding yourself in. Those are sometimes handy to have on there, so you can see what you did and didn't want to keep around. But when you're getting to the end, and you want to do it again, you want to cull all those dead ends out, and just get to the place that you currently are. And so I want to have some kind of graph or tree that would get you from your current state and only show the commands that you took to get there. That's what I really want to add to VisiData, and I just haven't gotten to that point yet.

Tobias Macey: And another thing that you mentioned, which I also noted in the documentation, is the case where you have a file, which might have millions of lines, which would typically be either difficult or impossible to open in a more traditional terminal oriented editor, such as BM or Emacs or the less pager. And I'm curious what types of performance strategies and techniques you've used to be able to handle files like that, and particularly, being able to manipulate them without just exploding your memory usage.

Saul Pwanson: So one thing is that VisiData actually does explode memory usage. It doesn't explode as much as some other things, but in my mind, it actually does take up a lot of memory. But beyond that, though, I think there's a couple things that matter here. One is that there's an asyncthread decorator that you can add to any function in VisiData. And that means that when you call that function, it spawns a thread and goes off and does it asynchronously. And there's always one main interface thread that keeps active, that is constantly updating with its status. And then within that thread, you can add a little progress meter, and those are pretty easy to do. But the main thing is that because it is so easy to spawn additional threads, I do it all the time, whenever I'm doing anything that might take a while. And I know that any kind of linear operation on data, because you might have millions of rows, might take a while, so I want to make each operation spin off its own thread. And that keeps me conscious of how much time things are taking, for one thing.

But I want to say that I actually don't think VisiData is fast, in itself. It's just responsive, and it turns out responsiveness matters more than speed. I would rather spend 10 seconds seeing a progress meter update and make it to the end, than five minutes with no progress and no knowledge of how long it's going to take at all. The first one is kind of soothing — I can take a break for 10 seconds — and the second one is very frustrating and makes me almost on edge. Like, do I need to do something here in order to kill it before it takes over my entire computer.

The other thing that's important is that tools like Excel or vim often want to own the data. They want to import it into their own structure and format, and that's what causes the thing to really blow up. Every cell has to be stored separately in its own custom thing. Whereas the key architecture thing that VisiData has that makes it so flexible is that it stores the rows natively. And so whatever I get from whatever Python library, that just becomes the row. Every item is just that object. And then columns are basically lenses into those rows. And so VisiData is computing the cells every time. It doesn't grab the cells and put them in, it just goes ahead and computes it whenever you want to see it. And it's adding a column, trivially – you can add a column in constant time. And that also means that saving, for instance, is comparatively slow in VisiData, because it means it has to do the evaluation of all those columns for every single cell when it's saving. But it turns out, you don't actually have to save everything, usually — you're not actually looking at everything, only looking at the first hundred rows, or a few columns, or whatever. So I feel like between the threading and the conception of rows as objects and columns as architecture, that's what really helps VisiData stay focused on the user experience like that.

How has VisiData evolved over time? [17:12]

Tobias Macey: Can you dig a bit more into the architecture and implementation of VisiData itself and some of the ways that it has evolved since you first started working on it? And I know that you've also got an upcoming 2.0 release, so maybe talk a bit about some of the evolution that's coming there, and then any libraries or language features that you have leaned particularly heavy on and found most useful in your work.

Saul Pwanson: Probably the biggest evolution is that, when I first was doing it, it actually started out in a single file script, I put everything into a single vdtui.py. And the idea was that it was a drop-in thing: you could copy it over to a server over SSH, and then you could start using it, and you wouldn't need anything but the base Python. And I licensed that as MIT. And then as I started adding more modules and internal plugins to it, I licensed those as GPL-3. I was trying to keep this very clean separation, because the idea was that you could use the vdtui, which was very similar to the thing I had at F5, for making all kinds of other apps. But then nobody really took me up on that, and it's kind of hard to use somebody's single file library like that unless it's a super tight little library. So I kind of gave up on that, and am now heading more towards a more plugin platform architecture, where VisiData the app is the thing that hosts the individual plugins that you can add. It may even wind up having a vgit application that you can use. But it's kind of turning on its head. As opposed to incorporating vdtui in some particular version of your thing, you're actually using the VisiData library.

Beyond that, it's now just a bigger open source project. It's got a whole packaging release cycle. And I'm working with Anja, who has been very instrumental in some of the packaging, testing, and documentation stuff that we've been doing. And it's just taken off and gained more traction in that sense. You also asked about the libraries that I like. One of the things that I'm doing to keep performance good is that I take very few dependencies. I feel like layers are how things get messy. So the fewer layers that you have, the better off it'll be, if you can wind up coding everything in between there. So as far as the libraries that I use, obviously curses is essential, but that's built into the Python standard library. The Python standard library is really fantastic, and everything's included, which is a super bonus. But then also, the PyPi ecosystem in general is so broad, that any format that I come across, HDF5 or Excel or whatever, they have a library already for it. And it's a library that you can use in Python, you write a page of code and you've got the stuff in there. And then all the loaders are just importing those libraries, then calling them, and putting the rows and the return in there and having some columns around — it's a nice, pretty simple concept.

One thing you mentioned was that you wanted to know other ways that I keep VisiData fast. I've been very focused on making sure that it starts up very quickly. I feel like, if there's a half second of startup time, it just gets in the way and it feels like a certain kind of friction. And so one thing that I do a lot of is lazy importing. For all these libraries, I have no idea how long they're going to take to load or start up themselves. And I know that there are some pretty heavy ones that VisiData uses sometimes, but I don't use those unless I need to. When you open up an Excel file, for instance, that's the point where it imports the XLS library. And if you don't ever load an Excel file, then it doesn't have to spend the time doing that. So that's another one of those tricks.

Before we move on, I wanted to mention the Python dateutil library. I don't take many dependencies, but that's one that I've been very happy to take, because it parses any date format that you can throw at it. It's amazing. It's a best in class detection and parsing tool. Also, another feature that I have used a lot is Python decorators. I think that's a pretty standard thing, actually, but I use them a way of tagging functions. For instance, I mentioned the asyncthread decorator, which takes a pretty advanced concept and reduces it to just the essence, so that I don't have to think or work hard to have those concepts work for me.

Tobias Macey: Yeah, the decorator's capabilities and the syntax that allows for it definitely makes it a lot easier to organize code and concepts. You can just drop it on top of a function definition without having to try and incorporate it into the body of the function and remember, "what are the return types? What are the inputs?" You just say, "it wraps this function and then it handles it, I don't have to think about it anymore." And you can just keep it all in a utility's library, for instance.

Saul Pwanson: Yeah, exactly. Totally. I will say that one of the things that I wish the decorators did support was to be able to put the def function, name and signature on the same line as the decorator. It's a minor point, but I use a lot of grep type tools, and I would like to be able to know that function that I'm looking at is an asyncthread, or has a deprecated decorator now, and be able to see that when I'm searching for functions, for instance.

Tobias Macey: Well, one tip there is that if you add the grep -b1 that it'll show you the line that you want, as well as the line just before it.

Saul Pwanson: That's a great tip, thank you.

Why the terminal? [22:36]

Tobias Macey: Another thing that stands out with VisiData, particularly, is the fact that it is entirely terminal-oriented, whereas a lot of data exploration tools will be more focused on a graphical interface, or trying to embed into a Jupyter notebook for providing some sort of visualization. And I'm wondering, what was your motivation for focusing so closely on that terminal interface and making it a command line client?

Saul Pwanson: I feel like the terminal is my home. I've been using terminal since the '80s, and so I'm very comfortable in that environment with those kinds of restrictions. I'm also much more comfortable with the keyboard than a mouse, but as I am getting older, it's becoming harder to just type verbosely. So I wanted individual keystrokes to do things, and I couldn't figure out how to do that in, for instance, a Jupyter notebook. The other thing is that because the terminal is so old, and mature, let's say, it's a universal interface. Any platform has an SSH client and a terminal. The only other way that you can get that level of universality is with a web browser and, for instance, an electron client, and that's way too heavy. So I feel like the choice is between a terminal which is very light, and an electron client which is very heavy. That choice for me is obvious. I want it to be a very quick in and out thing. I don't know how I would do the same kinds of things as quickly in the web, or even in a native app, if you have to reach for the mouse. You can do it with a with a given native app for sure. But then you have to make a native app for Mac, for Windows, for Linux, et cetera, and I just didn't want to do that. And so really, VisiData is about only about 10,000 lines of code all told — which is actually quite a bit, but it's not that much when you compare it to comparable tools like Open Refine. And I think that's a testament to being on the command line. I can do the minimum necessary to get the job done, and don't have to worry about things like pixel width. It's like, "No, you choose your own font, you choose your own font size and ways to interact with the thing — I'm just here to get out of the way."

Tobias Macey: It also has the benefit that, as you were describing originally, you can just copy it over via SSH, or now pip install it on a remote machine. You don't have to go through the hoops of trying to set up a way to have a graphical interface to that remote box, you can just copy it over. And it broadens the reach and capabilities and use cases for the tool, where the only access you have is via terminal, which as you said, despite its age is still a fairly common experience for people who are working in technology.

Saul Pwanson: Yeah, absolutely. I work remotely now, and we do have screen-sharing apps, but they don't always work that great, and you have to sometimes install some other plugin. The app that I love most, is called tmate — it's a tmux wrapper, I guess. And you just install it, and you type tmate, and you get an instant shell into your own machine and you can give somebody an SSH link. And they can do that, and people are usually amazed. Between that and VisiData, now we have an instant data exploration platform we can share, alongside a chat client or whatever. That's it. And I find that to be so much more accessible than modern video chat platforms — even us, at the start of our session here, had technology difficulties. The shell is very reliable, by comparison.

Tobias Macey: By the fact that you are targeting the terminal environment, what are some of the constraints that that brings with it, in terms of your capabilities that you can bring into VisiData? What are some of the most challenging aspects of trying to build a user interface for data exploration and analysis, within this environment that is so graphically constrained, particularly given the fact that you have incorporated some visualization capabilities, and just some of the ways that that manifests?

Saul Pwanson: I have to say that I've been pleasantly surprised that I haven't felt as constrained as I thought I would. For example, once I discovered that I could use Braille characters to do the graphing, it kind of just worked. It's not perfect, but if you want more perfect things, you should be using other tools. And in fact, that's one of the things I think that is important with VisiData, is that it's not meant to be a be-all and end-all. It's kind of a glue technology, right? It's an adapter. And so, once you figure out what you want to do, then you should go to the fancier tool and do it right. But there's no reason to do everything super right from the get go. You just want a quick glance at it.

You asked about the most challenging aspects of building a terminal UI. And I have to say, the thing that's been most challenging for me is knowing that if VisiData were on the web, it would probably be worth a lot of money, and doing it in the terminal means that I've kind of eschewed that. No one really pays for terminal tools. I shouldn't say no one — I actually have several Patreon subscribers, and I'm really appreciative of them.

But you know, if you think about how VisiCalc back in the day was a fancy program that sold for hundreds of dollars, I can't imagine doing that with VisiData. That's just not how the world works. Although, if it was a native app, that might make it possible to sell VisiData, for instance. The other thing that is more challenging than you might think, is that you mentioned pip installing something. And that's really great for people who already have Python and already know how to use pip. But I actually think VisiData is a pretty reasonable tool to use for anybody — anybody who's willing to use the keyboard anyway. And yet, installation is one of the trickier parts, right? If I wanted my partner to go and install it on her computer, I have to tell her "install this, click here, download this." People just want a single thing they can download, they can double click on, and then go. And that's just not how the terminal world generally works. So I feel like installation is one of those weird things, where there's a larger barrier to entry than there should be, and yet I can't find a super easy way around that. So that's just how it has to be.

Tobias Macey: I'm also wondering which terminal environments it supports. Because Windows is generally one that's left out of the support matrix for a command line tool, but because of the fact that Python does have the curses interface built in, or if you are relying on the prompt toolkit library, I know that there is the possibility of being able to support the Windows command lines, I'm just curious what the overall support matrix is for VisiData.

Saul Pwanson: I started off doing it in Linux, because that's what I run, and then it turns out that it works on a Mac terminal just fine. And I didn't have access to a Windows machine, and so I was like, well, Windows isn't fully supported. It turned out that people were just running it under Windows subsystem — Windows Subsystem for Linux, WSL, I think is what it's called? And it basically just worked there. Then somebody submitted a small patch, and it works even without WSL now, I think, on a more recent version of Windows. So I actually have never used it on Windows, but I know we've got quite a few people. Some Italian open beta users, like Andrea Borruso, love using it on Windows, and it works fine. So as far as I'm concerned, it works on Mac, Windows, and Linux and seems to be fine on all of those. So I wouldn't say those platforms are what we support, necessarily. But if it works, I'm not gonna say we don't support it either, right?

The design of keybindings [29:54]

Tobias Macey: [chuckles] Another peculiarity of building a command line oriented client is that it is heavily keyboard-driven, as you mentioned, and that means that you need to create the set of key bindings that will do whatever it is that you wanted to do. So I'm wondering how you've approached the overall design and selection of those key bindings, to ensure that there is a set of internal consistencies, and that the key bindings make semantic sense, but also so that you don't run out of key bindings in the event that you have some new capability that you want to run into, because it is a limited space, even when you do incorporate modifier keys.

Saul Pwanson: Yeah, totally. We're running up against that now. There are few keys left, in some sense — at least, few keys that people want to use. So the main thing is that I have to use mnemonics to make sure that key bindings stick in my own memory. I actually have a pretty bad memory myself, and so if they don't fit my mental model, I can't remember them from one month to the next. So I've made sure that they at least make sense to me. And because I've been using terminal stuff for so long, I'm already tuned into the the existing text culture that is around. So a lot of the key bindings are borrowed from vim, like d for delete, a for add, and so on.

Actually, when I showed it to a guy on Ingy döt Net, he chastised me for the fact that Ctrl+Z didn't suspend in VisiData, and I was like, "Oh, you are totally right, that needs to be fixed right away." And I did fix it right away, because you want to make sure that the things that people are used to will still work. And so Ctrl+C will just work. The other thing I think is really important is that there's layers of mnemonics. We have a couple of modifier keys, and I wanted to keep those very simple. Like, I think that vim is great, but it feels daunting when you see all the different possible combinations.

And so I hope that when you see that VisiData has exactly two prefix keys, that it feels like "okay, maybe we can wrap our heads around this." And those are just modifiers on other existing commands. And so there's layers. Another piece of this is that column transreferences, for instance, are all on the symbol keys. And in fact, all the types are all on a single row. And so you've got date is on @, and converting into an int is on the # sign. Those are all just adjacent to each other on the top row of the keyboard, and then other column things are also symbols.

And so you don't get confused thinking well, "Isn't an int i?" "– No, it's one of those ones up there." "Which one is it? Oh, I think it might be the one that looks like that." That helps me, anyway. And then everything that's for going to a new sheet is all on the Shift key. For me, it's like shift and sheet almost rhyme. So Shift+F goes to the frequency sheet, stuff like that.

And then finally there's symmetry, I think that's really important. So I reserved all of the pairs. For instance, open paren (, close paren ), open brace {, close brace }, those were all reserved from the get go, for things that have both a front and a back. So sorting is on the square brackets, and one way goes ascending and the other way goes descending. Scrolling down to the next item is a greater-than sign, versus the previous item, less-than sign. And I feel like symmetry on those things is very useful, but then also, more largely, symmetry between commands. So the g prefix goes bigger, and the z prefix is smaller, more precise. And so when you say "I want to delete all the rows I've got selected." That's gd. "I want to unselect all rows," that's gu. To me, in a way, it feels like what makes sense. You may not know it before you discover it, but once you discover it, then it's sticky. I feel like that's a super important thing. So, the VisiData interface isn't made for the first time user, it's made for, like, the third time user.

Tobias Macey: Yeah, there are definitely a lot of peculiarities, and a sort of culture and history built up around different key bindings. And as you mentioned, vim has its own set, Emacs has a different set of key bindings that people will be familiar with, and then there are any number of command line tools, that have all created this sort of general pattern of how you would craft these key bindings. So it's definitely interesting to hear some of the history of how you have approached it, because of your particular toolset choices. And like you said, anybody who's been living on the terminal long enough will find it fairly natural. I appreciate the care that you've put into considering how you add new key bindings so that it doesn't just end up cluttered and so that it – so that you can have some sort of mnemonic muscle memory of being able to recreate a certain workflow once you pick it up. Because with any tool, there will be periods where you put it down for a while and don't come back to it. And then when you do come back to it, you want to be able to just get right back to being productive without having to go back to try and remember what were all the key commands, and look at the reference manual.

Saul Pwanson: Yeah, totally. The other thing that we did was that a year ago or so we instituted longnames. So originally the key bindings was all we had, for example, Shift+F means the frequency table, and that was the only way you could access it. But now there's actually a longname for that, I think it's open-frequency-table. I'm not sure that's exactly right, don't quote me on that. But then you bind Shift+F to that, so people can rearrange their keyboard, if they really want to, but it also gives them the ability to add commands that aren't bound to any key. For instance, if you make your own command in your ~/.visidatarc, you can make that and bind it yourself. Or if we create a command that's very unused — for instance, random rows. That used to be on Shift+R, but then we made Shift+R be redo, in the undo-redo pair, because that's kind of a symmetry between Shift+U and Shift+R, as opposed to vim's lowercase u, Control+R, which doesn't make natural sense, to me anyway. Anyway, I moved random rows off of Shift+R, but then there's no real good place for the random selection of rows to go. So since I feel like selecting random rows is an infrequent operation, I put it on a long name, so then you can just press the spacebar and enter in a long name, and it goes ahead and does it. So that's another tactic, is to start moving things off of the default key bindings, into a huge list of possible commands that you could use.

Tobias Macey: Yeah, that's another thing that's got a long tradition, both in vim and Emacs and other tools — being able to have a way of opening a prompt, so that you can then start typing, as you said, a long-named command, or being able to start typing it and then maybe tab through to cycle through what the commands are, so that you don't have to necessarily remember all those off the top of your head as well.

Saul Pwanson: Mm-hmm, totally. We do have tab completion, and it works anywhere that we have open prompt, including the long name thing.

VisiData's capabilities [36:37]

Tobias Macey: Then, in terms of the types of analysis that VisiData can do out of the box, I know that you mentioned frequency analysis or histograms. But I'm curious, what are some of the other capabilities that come natively in VisiData, and any of the interesting plugins that you or others have contributed for being able to expand the capabilities and utility of VisiData?

Saul Pwanson: So, out of the box, it can do all kinds of interesting stuff: searching and filtering, bulk editing and cleaning, spot checking, finding outliers. I use it, actually more often than I would think, for file format conversion. The ability to load any format and then save it to JSON or character separated value or Markdown is super handy, and it gets me from here to there a lot faster than I could otherwise. Even scraping a web page for its tables is basically built-in. Jeremy Singer-Vine, for instance, has written several plugins already for the current version. He wrote one that does row duplication, and a loader for the FEC, the Federal Election Commission dataset, and you just download those and import them in your VisiData RC, and they're ready for action right away.

Tobias Macey: And when I was looking at the documentation, it seems that one of the libraries that you can load into it is pandas, and I'm wondering if that means that you can expose all of the pandas capabilities as well, as you're exploring these data sets. Because I know that that's often the tool people will reach for, their first time digging into a data set just to see what's the shape of it. And so I'm curious how that works into the overall sort of use case of VisiData as the exploratory tool, and then where the boundaries are, when you might want to jump to pandas, or if you could just incorporate that whole flow together.

Saul Pwanson: Yeah, that's interesting. I made a very simple adapter for pandas, it literally was maybe 20 lines of code at first, just because pandas supports a lot of different loaders, too, and it's super handy to be able to use those and browse those. But what's interesting is that pandas and VisiData don't actually play that well together. In order to do things like sorting, for instance, VisiData grabs each value and sorts based on that, but pandas' built-in sort function does it more efficiently. And there's just no good way to do that automatically. You have to write all the commands in a way that's compatible with pandas for pandas sheets. And that's totally doable, but it's a fair amount of work, and I haven't done it. Somebody did make some modifications to make pandas more responsive in certain cases, to make things work better. And that's totally doable. Like I said, it takes a fair amount of work, and it doesn't happen naturally. You can't just use pandas' things like you'd think. You can use some of the functions that pandas has, on pandas sheets, and even on non-pandas sheets if they're standalone functions, but to use a pandas dataframe just naturally like you would, you probably are better off using it in Jupyter by itself.

The community [39:40]

Tobias Macey: In terms of the overall growth and adoption of VisiData, it seems that there's a decent community that's grown up around it. And I'm wondering how you approach the project governance and sustainability as a solo developer, and how you are looking to grow the community and incorporate more people into the future of VisiData.

Saul Pwanson: Yeah. Well, you're saying I'm a solo developer, but I've got a little bit of help now. Like I said, Anja has been instrumental in helping me with decisions and discussions, stuff like that. There's also a #visidata channel on freenode that several of us are hanging out in, where people can talk about things and ask questions. I prefer personally a chat system like that, because I find myself doing a lot better with chat, which I've been on chat platforms for over 20 years now, than I do with email. Email is a lot heavier, it requires more intention and attention, and with chat I can just toss up an answer and it's just done.

So I'm, of course, the decider on those things. But I have to be honest, it kind of feels like I'm discovering VisiData more than creating it at this point. It's like a chunk of marble to a sculptor, it tells me what it wants to become. And there are some things that I didn't even consider, and then I look at it like, why didn't I think of that already? For example, the rowtype down in the lower left corner, where it shows you lines or columns or whatever the current data type is. For the longest time, almost to 1.0, that just said, "rows". And I didn't know why I even put the text there, if all it was going to do is say the same thing every time. And yet I felt strongly that it should be there. And then once I realized that should just be the rowtype, I was like, oh! And I don't feel like that was my creation. That's just how it had to be, if that makes sense.

So there's that, and then you mentioned about project sustainability. The thing is, my energy is my most precious resource — my energy and ability to code. I have a day job, and so I come home at night, and it's then that I want to screw around with VisiData. But it's really hard to summon the energy when I don't have a concrete use case, or someone that really cares about something. So I have the most energy when somebody is around and is enthusiastic, and they have a sample data set, and they're like, I just want to do this thing to it. It becomes kind of a little puzzle you can put together. Can I use this existing command to do this, or here's a one-liner to put in their ~/.visidatarc, or does this require a different core piece of functionality — so that now, not just that case, but ten other cases can be solved too. Those are the things I enjoy the most, I actually do really enjoy solving those puzzles. But then sometimes we'll have people who ask for a generic feature, and it doesn't feel like it's very immediate, it's more abstract. Or if I have a concept for something that I've been wanting for a while, and because there's nobody who really, really wants it, I get less motivated, and I just kind of decide to do something else.

Innovative uses of VisiData [42:51]

Tobias Macey: So what are some of the most interesting or unexpected or innovative ways that you've seen VisiData used?

Saul Pwanson: I feel like we have a couple of superfans, people who will use it for well more than they really should. One of them is a guy named Christian Warden, and he does a lot of Salesforce consulting and stuff. So he's got buckets of data, and just wants to move through it quickly. He built a duplicate row finder for some dataset with Python expressions and a .vd script. VisiData is not made for internal computation; you can have elements that compute within a row, no problem, but if you want to look at the previous row, it's not really meant for that. I mean, I'd like to add that at some point, but I haven't figured out a really great way to do it yet. But he figured out how to pull it off, and it was an amazing beast, and it worked. It actually exposed a bug with computation, so that it was taking forever to run. But once he fixed that, it actually was remarkable, like, wow, you really have turned this into yet another Turing-complete programming environment. [laughs] So that has been kinda weird.

Also, I'm not sure if you or your listeners have seen the lightning talk that I gave a couple years ago, but I had some data that had lat-long coordinates, and I was just curious if I could plot those in my little canvas. And it turns out that plotting latitude-longitude as x,y coordinates, works really well for maps, even for like a million points, there you go, you can see the distribution of things. It was surprising to me that it worked as well as it did to be honest. Like, I don't think this is built for geographic information at all, and yet you can kind of pull it off. So that's been both surprising and unexpected and – yeah, kinda pleasing, too.

Tobias Macey: Yeah, I did see that lightning talk. And that was one of the things that I was kind of blown away by as far as the visualization aspect of VisiData, given that it's a terminal environment. And so it's interesting to hear how you just mapped the lat-long to x,y coordinates. And I'm sure that you just figured out what were the maximum bounds of the coordinates that you had to figure out, what the overall plane coordinates needed to be in relation to each other, so, that's pretty funny.

Saul Pwanson: Yeah, thank you. One more thing, if you don't mind, that has been kind of surprising to me is how meta the thing goes. So editing VisiData's internals, using VisiData's own commands, is something that's been kind of surprising for me. Just yesterday, a user asked how they could get a type column on the Describe Sheet. And I thought about it and I was like, you know what you can do, is: you can go to the Columns sheet, and you can copy it from there, and then you can paste it onto the Columns sheet of the Describe Sheet. And it'll just work. And it's like, you couldn't possibly do that with Excel. Similarly, if you've got 1000 columns, and you want to search or select all the ones that begin with a certain thing, and remove all those from the set, you can do that in VisiData. And that's no problem, it just works just like anything else. I have no idea how you do that in almost any other tool. And so I feel like the metadata editing aspect of it has been surprising for me, even though I put it in there, but the fact that it works as well as it does has been really kind of interesting.

The future [46:08]

Tobias Macey: And looking forward, what are some of the features or improvements that you have planned for the future of VisiData?

Saul Pwanson: So right now we're working on the 2.0 release, which will be still a couple months out, and the goal is to stabilize it. The current version, 1.5.2, is actually incredibly stable, and really tight — there's at least one bug I've seen, but that's fine, it was an edge case. But the API is all over the place, it's not as coherent as the user interface. And so one of the things we will want to do is make the API stable, and produce some more coherent documentation about the internals, we're calling it the "Book of VisiData." And the point of that is so that we can let it rest and work on some other things, but we can let other people go wild and share their own creations — plugins, commands, loaders, or whatever, without destabilizing the core goodness that is there. So I'm sure there will be a 2.1 or whatever after that. But I'm really hoping that after 2.0, development can slow down, and I can move on to some other projects that I have in the queue.

So one of the things that we've been talking about that's gotten a bit of traction is something I've been calling "Where in the Data is Carmen San Mateo?", which is a throwback to an old game from the 80s and 90s, which maybe you've heard of, called "Where in the World is Carmen San Diego?"

Tobias Macey: I used to play it, yep.

Saul Pwanson: Okay, yeah, so you're familiar. So the idea is that I want that kind of game, but with data and data sets. So it'll be for hardcore data nerds, kind of an escape-the-room game, or a choose-your-own-adventure kind of game, where you're solving a crime, but you get data sets to look at, and that's how you get the actual clues and solve the puzzles. I'd like to work on that, and so that's my next project, probably, in the queue. But I don't really want to do that until we've got VisiData 2.0 locked down, and we feel like it's a stable place for everybody.

The terminal renaissance [48:09]

Tobias Macey: Are there any other aspects of VisiData or data exploration or any of the other accompanying projects that we didn't discuss yet, that you'd like to cover before we close out the show?

Saul Pwanson: Not specifically, although I did want to say that I feel like we're in an age of kind of a terminal renaissance. We went through the period of the late '90s and early 2000s, where it was more graphics, all the time. And that was the obvious way up and out. But the terminal has been with us throughout, and I definitely have never left it. I feel like within the past, maybe 10 years or so, with projects like ohmyzsh and tmux/tmate and many other ones, that the terminal has been kind of getting a resurgence. And even now, when people go to data science boot camps and stuff, they have to learn the terminal and get involved in there too, because you need to be able to do that in order to dive deeply into data stuff. And so I feel like this is part of that. VisiData is saying no, wait a second, you don't have to be in the web graphics world in order to be high perfection — in fact, not being in that world makes it a lot easier for you, if you can just embrace the fact that you're going to be at the terminal and using a keyboard.

Tobias Macey: Yeah, I definitely appreciate the fact that there is a lot more focus being paid to just making things that work in the command line and being able to stitch them together. Because, as you said, the graphical interfaces – well, they are appealing, and it's easier to sell something to somebody who isn't as technical, if you're in that environment – they do bring a lot of extra weight and requirement to the development and maintenance of them, as well as, in some cases, the use of them, because they are definitely much more mouse -riven. And it makes it harder to be able to have just a unified flow.

Saul Pwanson: Yeah. And you know, to be honest, the versions of iOS and Windows are going to keep marching forward. And I have no doubt that if I made an app for the current version of either of those things, that in the next three or four years, it wouldn't work with the next version. I am actually pretty confident that if I don't touch VisiData for the next four years, that you'll be able to use it in the next version of Python, on whatever platform, no problem at all. And I find that to be really motivating to do a good job now, because I don't have to keep writing it, I have to do it good once.

Closing [50:34]

Tobias Macey: For anybody who wants to follow along with the work that you're doing or get in touch, I'll have you add your preferred contact information to the show notes. And with that, I'm going to move us into the picks, and this week I'm going to choose a newsletter that I actually found while I was doing some research for this conversation, called Data is Plural. It's maintained by Jeremy Singer-Vine, who you mentioned a few times. It's a weekly newsletter with different interesting data sets that looks to have some fairly curious discoveries. So if you're looking for something to experiment with VisiData, you might have some interesting finds in there. So with that, I'll pass it to you, Saul, do you have any picks this week?

Saul Pwanson: I wanted to promote tmate, which I think I have during this episode already. Definitely give tmate a look, if you're a terminal user and want to have a multi-user experience; there are a lot of other tools in that same vein like mosh, I want to give a shoutout to, the mobile shell for less than perfect network connections. And – yeah, there's all kinds of good tools out there, but I'm not sure I can come up with any more off the top of my head.

Tobias Macey: All right, well, thank you very much for taking the time today to join me and share the work that you've been doing with VisiData. It's definitely something that I'm going to be experimenting with because I spend a fair amount of my time on the terminal, and have to do a lot of exploration of random data sets, whether it's just .csv's or piping things from different batch commands. So thank you for all of your work on that tool, and I hope you enjoy the rest of your day!

Saul Pwanson: Awesome, thank you very much! I hope you have a great day too.

Outro [52:06]

[Outro music begins.]
Tobias Macey: Thank you for listening. Don't forget to check out our other show, the Data Engineering Podcast, at dataengineeringpodcast.com, for the latest on modern data management. And visit the site at pythonpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something, or tried out a project from the show, then tell us about it. Email host@podcastinit.com with your story. To help other people find the show, please leave a review on iTunes and tell your friends and coworkers.
[Outro music ends.]

Corporate Sponsors

Sponsor saulpw on Github