Associate Editor: Karim Ali (@karimhamdanali)
Since 2010 I've been working on Python Tutor, an educational tool that helps novices overcome a fundamental barrier to learning programming: understanding what happens as the computer runs each line of code. Python Tutor allows anyone to write code in their web browser, see it visualized step by step, and get live real-time help from volunteers. Despite its now-outdated name, this tool actually supports seven languages: Python, JavaScript, TypeScript, Ruby, Java, C, and C++. So far, over 3.5 million people in over 180 countries have used it to visualize over 50 million pieces of code. You can find research papers related to this project on my publications webpage. But in this blog post, I want to dive into some implementation details that I haven't gotten to highlight in my papers.
Let's start with a simple Python example (run it live here):
This code creates instances of three basic data structures: an ordered collection (called a list in Python), a key-value mapping (called a dict or dictionary in Python), and an unordered set. Note how elements within these data structures can point to other data structures; for instance, the second element of the top list (accessible via the global variable x) points to the bottom list. Using Python Tutor, novices can easily see pointer and aliasing relationships by following the arrows in these diagrams. Without this tool, they would need to print out serialized string values to the terminal, which obscures these critical details.
How is Python Tutor implemented? By hooking into Python's built-in debugger protocol (bdb in its standard library). This tool runs the user's inputted code, single-steps through execution one line at a time, and traverses the object graph starting from globals and stack-local variables. It records a full trace of the stack and heap state at all execution steps and then sends the trace to the web frontend to render as interactive diagrams.
The main limitation of this "trace-everything" approach is scalability: it's clearly not suitable for code which runs for millions of steps or creates millions of objects. But code written by instructors and students in educational settings is usually small -- running for dozens of steps and creating around a dozen data structures -- so this simple approach works well in practice.
Now here's the same code example ported to JavaScript (run it live here):
This heap object diagram looks exactly the same as the Python one, albeit with different labels: in JavaScript, an ordered collection is called an array, and a key-value mapping is called an object (note that there's also a Map type). The JavaScript implementation works in the same way as the Python one: by hooking into the debugger protocol of the Node.js JavaScript runtime.
Here's what this example looks like in Ruby, once again implemented by hooking into the interpreter's built-in debugger protocol (run it live here):
These three identical-looking examples show how the diagrams generated by Python Tutor are designed to be fairly language-independent. Novice programmers need to learn about concepts such as stack frames, scope, control flow, primitive data types, collections, and pointers. To facilitate this learning, Python Tutor implements a graphical abstraction layer that takes the details of each language's low-level trace data and turns them into higher-level diagrams that capture the essence of the associated programming concepts. This abstraction makes it straightforward to expand the tool to work on additional languages as demand arises. It also makes it possible to scaffold learning of one language when someone already knows another one, such as teaching Python programmers how to quickly get up to speed on JavaScript.
This tool visualizes Java code in a similar way, but I'll skip that illustration to save space. Let's now turn to a more challenging pair of languages: C and C++. Unlike code in the above languages, C and C++ programs are not necessarily type- or memory-safe. This means that hooking into a debugger such as gdb isn't enough, since it's not clear which chunks of memory correspond to valid data. Here's a C example to show what I mean (run it live here):
This code allocates a 6-element integer array on the stack (accessible via localArray) and a 10-element integer array allocated on the heap via malloc (accessible via b). It then populates the elements of both arrays at indices 1, 3, and 5. The resulting visualization shows those initialized values and ? symbols next to the remaining uninitialized values. In addition, it knows that the heap array has exactly 10 elements and does not try to read unallocated elements beyond that bound, which risks crashing the program. Readers familiar with C and C++ will recognize that such memory allocation and initialization data is not available to debuggers such as gdb. Python Tutor hooks into Valgrind Memcheck to get this vital data. Without something like Memcheck, it would be impossible to build a safe and accurate visualizer for C and C++.
Finally, let's end with a C++ example (run it live here):
This visualization shows object-oriented programming concepts such as calling an instance method Date::set(), its this pointer referring to a Date object on the stack (accessible via stackDay), and another Date object on the heap (allocated with the new keyword and accessible via heapDay). Just like it does for C programs, Valgrind Memcheck ensures that Python Tutor reads only memory that has been both allocated (here recognizing that there is only one Date object on the heap) and initialized (so that it doesn't show stale junk values).
That was my quick tour of how Python Tutor works for a variety of languages that are frequently used to teach programming. The underlying principle that drives my implementation decisions is that authenticity is key: experts can work around bugs or other quirks in debugging tools, but novices will get confused and demoralized if a tool renders inaccurate diagrams. Thus, this tool needs to be able to take whatever code that users put into it and do something reasonable, or at least fail in a transparent way (e.g., stopping after 1,000 execution steps and suggesting for the user to shorten their code). I've tried out alternative language interpreters and experimental runtimes to get more detailed tracing (e.g., finer-grained stepping into subexpressions rather than line-level stepping), but time and time again I've gone back to built-in debugger hooks and widely-deployed tools such as Valgrind since they are far more robust than experimental alternatives. Try it out today at http://pythontutor.com/ and let me know what you think!