Tuesday, February 26, 2019

How Python Tutor Uses Debugger Hooks to Help Novices Learn Programming

By: Philip Guo (@pgbovine)
Associate Editor: Karim Ali (@karimhamdanali)

Since 2010 I've been working on Python Tutor, an educational tool that helps novices overcome a fundamental barrier to learning programming: understanding what happens as the computer runs each line of code. Python Tutor allows anyone to write code in their web browser, see it visualized step by step, and get live real-time help from volunteers. Despite its now-outdated name, this tool actually supports seven languages: Python, JavaScript, TypeScript, Ruby, Java, C, and C++. So far, over 3.5 million people in over 180 countries have used it to visualize over 50 million pieces of code. You can find research papers related to this project on my publications webpage. But in this blog post, I want to dive into some implementation details that I haven't gotten to highlight in my papers.

Let's start with a simple Python example (run it live here):

This code creates instances of three basic data structures: an ordered collection (called a list in Python), a key-value mapping (called a dict or dictionary in Python), and an unordered set. Note how elements within these data structures can point to other data structures; for instance, the second element of the top list (accessible via the global variable x) points to the bottom list. Using Python Tutor, novices can easily see pointer and aliasing relationships by following the arrows in these diagrams. Without this tool, they would need to print out serialized string values to the terminal, which obscures these critical details.

How is Python Tutor implemented? By hooking into Python's built-in debugger protocol (bdb in its standard library). This tool runs the user's inputted code, single-steps through execution one line at a time, and traverses the object graph starting from globals and stack-local variables. It records a full trace of the stack and heap state at all execution steps and then sends the trace to the web frontend to render as interactive diagrams.

The main limitation of this "trace-everything" approach is scalability: it's clearly not suitable for code which runs for millions of steps or creates millions of objects. But code written by instructors and students in educational settings is usually small -- running for dozens of steps and creating around a dozen data structures -- so this simple approach works well in practice.

Now here's the same code example ported to JavaScript (run it live here):

This heap object diagram looks exactly the same as the Python one, albeit with different labels: in JavaScript, an ordered collection is called an array, and a key-value mapping is called an object (note that there's also a Map type). The JavaScript implementation works in the same way as the Python one: by hooking into the debugger protocol of the Node.js JavaScript runtime.

Here's what this example looks like in Ruby, once again implemented by hooking into the interpreter's built-in debugger protocol (run it live here):

These three identical-looking examples show how the diagrams generated by Python Tutor are designed to be fairly language-independent. Novice programmers need to learn about concepts such as stack frames, scope, control flow, primitive data types, collections, and pointers. To facilitate this learning, Python Tutor implements a graphical abstraction layer that takes the details of each language's low-level trace data and turns them into higher-level diagrams that capture the essence of the associated programming concepts. This abstraction makes it straightforward to expand the tool to work on additional languages as demand arises. It also makes it possible to scaffold learning of one language when someone already knows another one, such as teaching Python programmers how to quickly get up to speed on JavaScript.

This tool visualizes Java code in a similar way, but I'll skip that illustration to save space. Let's now turn to a more challenging pair of languages: C and C++. Unlike code in the above languages, C and C++ programs are not necessarily type- or memory-safe. This means that hooking into a debugger such as gdb isn't enough, since it's not clear which chunks of memory correspond to valid data. Here's a C example to show what I mean (run it live here):

This code allocates a 6-element integer array on the stack (accessible via localArray) and a 10-element integer array allocated on the heap via malloc (accessible via b). It then populates the elements of both arrays at indices 1, 3, and 5. The resulting visualization shows those initialized values and ? symbols next to the remaining uninitialized values. In addition, it knows that the heap array has exactly 10 elements and does not try to read unallocated elements beyond that bound, which risks crashing the program. Readers familiar with C and C++ will recognize that such memory allocation and initialization data is not available to debuggers such as gdb. Python Tutor hooks into Valgrind Memcheck to get this vital data. Without something like Memcheck, it would be impossible to build a safe and accurate visualizer for C and C++.

Finally, let's end with a C++ example (run it live here):

This visualization shows object-oriented programming concepts such as calling an instance method Date::set(), its this pointer referring to a Date object on the stack (accessible via stackDay), and another Date object on the heap (allocated with the new keyword and accessible via heapDay). Just like it does for C programs, Valgrind Memcheck ensures that Python Tutor reads only memory that has been both allocated (here recognizing that there is only one Date object on the heap) and initialized (so that it doesn't show stale junk values).

That was my quick tour of how Python Tutor works for a variety of languages that are frequently used to teach programming. The underlying principle that drives my implementation decisions is that authenticity is key: experts can work around bugs or other quirks in debugging tools, but novices will get confused and demoralized if a tool renders inaccurate diagrams. Thus, this tool needs to be able to take whatever code that users put into it and do something reasonable, or at least fail in a transparent way (e.g., stopping after 1,000 execution steps and suggesting for the user to shorten their code). I've tried out alternative language interpreters and experimental runtimes to get more detailed tracing (e.g., finer-grained stepping into subexpressions rather than line-level stepping), but time and time again I've gone back to built-in debugger hooks and widely-deployed tools such as Valgrind since they are far more robust than experimental alternatives. Try it out today at http://pythontutor.com/ and let me know what you think!

Tuesday, February 19, 2019

Software Improvement by Data Improvement

By: William B. Langdon, CREST Department of Computer Science, University College, London

Associate Editor: Federica Sarro, University College London (@f_sarro)

In previous blogs [1, 2] we summarised genetic improvement [3] and described the result of applying it to BarraCUDA [4]. Rather than giving a comprehensive update (see [5]) I describe a new twist: evolving software via constants buried within it to give better results. The next section describes updating 50000 free energy parameters used by dynamic programming to find the lowest energy state of RNA molecules and hence predict their secondary structure (i.e. how they fold) by fitting data in silico to known true structures. The last section describes converting a GNU C library square root function into a cube root function by data changes.

Better RNA structure prediction via data changes only

RNAfold is approximately 7 000 lines of code within the open source Vienna- RNA package. Almost all the constants within the C source code are pro- vided via 21 multi (1–6) dimensional int arrays [6, Tab. 2]. We used a population of 2000 variable length lists of operators to mutate these inte- gers. The problem dependent operators can invert values, replace them or update them with near by values. They can be applied to individuals values or using wild cards (*) sub-slices or even the whole of arrays. From these a population of mutated RNAfold is created. Each member of the popula- tion is tested on a 681 small RNA molecules and the mutants prediction is compared with their known structure [6, Tab. 1]. At the end of each gen- eration the members of the population are sorted by their average fitness on the 681 training examples and the top 1000 are selected to be parents of the next generation. Half the children are created by mutating one parent and the other 1000 by randomly combining two parents. After one hundred generations, the best mutant in the last generation is tidied (i.e. ineffective bloated parts of it are discarded) and used to give a new set of 50 000 integer parameters (29% of them are changed).
On average, on both big and small molecules of known structure (not used in training), the new version of RNAfold does better than the original. (In many cases it gives the same prediction, in some it is worse but in more it is better.)
Figure 1 shows RNAfold’s original prediction of the secondary structure of an example RNA molecule and then the new prediction using the updated free energy parameters.

Figure 1: Secondary structure (i.e. folding patterns) for RNA molecule PDB 01001. 1) Prediction made original RNAfold does not match well true structure right. For example the highlighted hairpin loop (red) is not in the true structure. 2) Prediction made with GI parameters is almost identical to the true structure. 3) True structure. (Figure 2 tries to show the three dimensional structure of two PDB 01001 RNA molecules in a larger complex.)

A new Cube Root Function

The GNU C library contains more than a million constants. Most of these are related to internationalisation and non-ascii character sets [8]. However one implementation of the double precision square root function uses a table of 512 pairs of real numbers. (Most implementations of sqrt(x) simply call low level machine specific routines.) The table driven implementation
is written in C and essentially uses three iterations of Newton-Raphson’s method. To guarantee to converge on the correct square(x) to double precision accuracy, Newton-Raphson is given a very good start point for both the target value x^(1/2) and the derivative 0.5x^(−1/2) and these are held as pairs in the table.
Unlike the much larger RNAfold (previous section), with cbrt(x) some code changes were made by hand. These were to deal with: x being negative, normalising x to lie in the range 1.0 to 2, reversing the normalisation so that the answer has the right exponent and replacing the Newton-Raphson constant 1/2 by 1/3 [8, Sec. 2.1]. Given a suitable objective function (how close 23 is cbrt(x)×cbrt(x)×cbrt(x) to x), starting with each of the pairs of real numbers for sqrt(x), in less than five minutes CMA-ES [9] could evolve all 512 pairs of values for the cube root function.
The GNU C library contains many math functions which follow similar implementations. For fun, we used the same template to generate the log2(x) function [10].

Figure 2: Three dimensional structure of two PDB 01001 RNA molecules (blue, orange) in a Yeast protein complex (green, yellow) [7, Fig 2. A].


  1. W. B. Langdon and Justyna Petke. Genetic improvement. IEEE Soft- ware Blog, February 3 2016.
  2. W. B. Langdon and Bob Davidson. BarraCUDA in the cloud. IEEE Software Blog, 8 January 2017.
  3. W. B. Langdon and Mark Harman. Optimising existing software with genetic programming. IEEE Transactions on Evolutionary Computa- tion, 19(1):118–135, 2015.
  4. W. B. Langdon and Brian Yee Hong Lam. Genetically improved Barra- CUDA. BioData Mining, 20(28), 2 August 2017.
  5. Justyna Petke et al. Genetic improvement of software: a comprehensive survey. IEEE Transactions on Evolutionary Computation, 22(3):415– 432, 2018.
  6. W. B. Langdon, Justyna Petke, and Ronny Lorenz. Evolving better RNAfold structure prediction. In Mauro Castelli et al., editors, EuroGP 2018, pages 220–236, Parma, Italy, 2018. Springer Verlag.
  7. Masaru Tsunoda et al. Structural basis for recognition of cognate tRNA by tyrosyl-tRNA synthetase from three kingdoms. Nucleic Acids Re- search, 35(13):4289–4300, 2007.
  8. W. B. Langdon and Justyna Petke. Evolving better software parame- ters. In Thelma Elita Colanzi and Phil McMinn, editors, SSBSE 2018, pages 363–369, Montpellier, France, 2018. Springer.
  9. Nikolaus Hansen and Andreas Ostermeier. Completely derandom- ized self-adaptation in evolution strategies. Evolutionary Computation, 9(2):159–195, 2001.
  10. W. B. Langdon. Evolving square root into binary logarithm. Technical Report RN/18/05, University College, London, UK, 2018.

You might also enjoy reading

  • James Temperton. Code ’transplant’ could revolutionise program- ming. Wired.co.uk, 30 July 2015. Online.
  • John R. Woodward, Justyna Petke, and William Langdon. How com- puters are learning to make human software work more efficiently. The Conversation, page 10.08am BST, June 25 2015.
  • Justyna Petke. Revolutionising the process of software development. DAASE project blog, August 7 2015.

Monday, February 11, 2019

Learning from Failure: Modeling User Goals in the App Market

By: Grant Williams (@g_will74), Nishant Jha (@nishant_vodoo), and Anas Mahmoud (@nash_mahmoud), Louisiana State University

Associate Editor: Federica Sarro, University College London ( @f_sarro)

App success has been broadly studied in the literature. This line of research is mainly driven by the significant business interests in the performance of mobile apps. In general, app success can be quantified based on market performance; apps that get more downloads, appear frequently on the top-charts, have steady user acquisition and retention rates, or generate more revenue, are more successful. Failure, on the hand, is a less well-understood phenomenon. Naturally, success is more appealing as an object of analysis than failure. However, market research has shown that failure should be studied in isolation from total prior experience. According to Madsen et al. [1], organizations learn more effectively from failures than successes. The ability to analyze individual cases of failure represents a unique opportunity for gaining knowledge about what went wrong, the key learning points, and how to prevent such failures in the future. 

To shed more light on app failure, in this article, we describe a case-study dedicated to analyzing the main reasons that led to the failure of Yik Yak, one of the most popular social networking apps of 2015. Our objective is to model the main users’ goals in domain of Yik Yak along with their relations to the core features of the domain. Models can help to translate tacit domain knowledge into an explicit form. Once explicit domain knowledge is created, it can be preserved, communicated, and passed through to others.

A Case Study on App Failure

Fig. 1 The number of monthly downloads after 
anonymity was removed and then restored. 

Launched in 2013, Yik Yak was a location-based social networking app that allowed users to post and vote on short messages (known as yaks). Yik Yak distinguished itself from its competitors by two main features: anonymous communication and geographical locality; users could anonymously post and engage in discussions within a 1.5 mile radius around their location. The combination of anonymity and locality proved successful, and by the end of 2014, Yik Yak had become one of the most popular social networking platforms on college campuses, with a market valuation close to $400 million. 
Despite its popularity among users, the anonymity provided by Yik Yak had a downside. Due to the lack of personal identifiability, Yik Yak posts were an easy vector for cyber-bullies to anonymously harass other users at their schools or universities. This problem became significant when the app was used to make threats credible enough for institutions to request police assistance. To control for cyberbullying, Yik Yak rolled out mandatory handles, where users would be forced to choose a screen name. The response from the community to this feature update was overwhelmingly negative. Consequently, the number of downloads and active users dropped sharply. Yik Yak’s attempt to reverse course by reintroducing anonymous posting was never successful in recapturing its previous popularity, eventually, forcing the company to shut down the app and suspend its operations in May of 2017. 

The story of Yik Yak represents a unique opportunity for analyzing one of the most recent cases of major failures in the app market. To get a systematic in-depth look into this case, we use feature-softgoal interdependency graphs (F-SIG). F-SIGs enable a comprehensive qualitative reasoning about the complex interplay between the functional features of an application domain and its users’ goals. To generate our model, we initially identified the main rival apps of Yik Yak in the domain of anonymous social networking apps. This list included Kik, Jodel, Firechat, Whisper, Swiflie, and Spout. To identify end-users’ goals in the domain, we analyzed user feedback available on app store reviews and the Twitter feeds of these apps [3]. Our analysis was focused on identifying user sentiments toward the core features of the domain [4]. The generated F-SIG model is shown in Figure 2.   
Fig. 2: A model of the common user goals (concerns) and their relationships to the core features of the domain of anonymous social networking apps.

Significance and Future Work

The F-SIG diagram in Figure 2 shows that several in-depth insights can be gleaned from analyzing user goals and their relations to the core features of the domain. Such information can serve as input for the Next Release Problem (NRP), which is mainly concerned with maximizing customer value by optimizing the subset of requirements to be included in the coming release. Furthermore, through the model's dependency relations, developers can get a sense of the synergy and trade-offs between features and user goals, thus can adjust their release strategies to focus on features that enhance the most desirable goals. This information can be particularly useful for smaller businesses and startups trying to break into the app market. Specifically, the proposed model will serve as a core asset that will help startup companies to get a quick and comprehensive understanding into the history of the domain, providing information about how specific user goals and features of the domain have emerged and evolved to their current state.
Our future work in this direction will be focused on automating the model generation process, utilizing automatic app store and social media mining methods as well as automated domain modeling techniques. Our goal is to devise automated methods for data collection, domain feature analysis, and model generation. These methods will be evaluated over large datasets of app data to ensure their practicality and accuracy.


[1] G. Williams and A. Mahmoud, “Modeling User Concerns in the App Store: A Case Study on the Rise and Fall of Yik Yak,” in IEEE International Requirements Engineering Conference, 2018.

[2] S. Jarzabek, B. Yang, and S. Yoeun. Addressing quality attributes in domain analysis for product
lines. IEE Proceedings - Software, 153(2):6-73, 2006.

[3] W. Martin, F. Sarro, Y. Jia, Y. Zhang, and M. Harman, A survey of app store analysis for software engineering, in IEEE Transactions on Software Engineering, vol. 43, no. 9, pp. 817-847, 1 Sept. 2017.

[4] P. Madsen and V. Desai, “Failing to learn? The effects of failure and success on organizational learning in the global orbital launch vehicle industry,” Academy of Management Journal, vol. 53, no. 3, pp. 451–476, 2010.