By Rishabh Singh (@rishabhs), Associate Editor.
In this world of ever-increasing automation from
self-driving cars to personal robots, how far are we in automating the art of
writing software? Surprisingly, the problem of automatically learning programs,
also known as “program synthesis”, has been an intriguing research area dating
back to 1970s [1]. A lot of work then focused on using deductive reasoning in
automated theorem provers to derive programs by systematically transforming the
program specifications. These techniques were able to automatically synthesize
simple programs involving numerical computations and data structure operations,
but didn’t quite catch on since often times writing a complete specification of
the program proved to be a more daunting task than writing the original program
itself in first place. As a result, the interest in program synthesis research
started to die down in late 1980s, but the research area has seen a recent
resurgence and is now considered one of the hottest areas in software engineering
and programming languages community.
What changed from
1980s?
In my opinion, the three biggest advances that have led to
this resurgence are: 1) exceptional advances in constraint solving algorithms
(such as SAT and SMT solvers), 2) slightly different formulation of the problem
(more on it next), and 3) Moore’s law giving us faster and faster compute
power. While the traditional techniques focused on learning arbitrary programs
from complete specifications, the recent techniques formulate the problem in two
key different ways. First, instead of relying on complete specifications which
are hard to write, they allow for incomplete and more natural specification
mechanisms such as input-output examples, demonstrations of program behavior,
partial programs with holes, natural language, etc. and their combinations.
Second, instead of considering Turing-complete languages for the hypothesis
space of the possible set of programs, they restrict the hypothesis space using
a Domain-specific language. This results in not only faster search algorithms
since the hypothesis space is more structured, but also results in learning
programs that are more understandable and readable.
Isn’t this an
undecidable problem?
Yes, in general it is an undecidable problem. Even answering
if a program always terminates is an undecidable problem. Here, synthesis
techniques are solving a much harder problem of learning a program that not
only terminates but also satisfies the properties defined by the specification.
For some fragments of language hierarchy such as restricted regular
expressions, there are decidable and efficient synthesis algorithms. Even for
more expressive languages, techniques based on bounded reasoning and abstract
interpretation have proven to be quite successful. Instead of providing
guarantees over complete input space (an undecidable problem), these techniques
only consider a finite bounded set of inputs with the assumption that if the
learnt program behaves correctly on the bounded set of inputs it is very likely
to also work for all possible inputs.
What can we
synthesize currently?
Even with all the recent advances, the ideal applications
for synthesis techniques have been in learning programs that are small (few
tens of lines) but are complex and tricky to write manually. Some of the recent
applications include synthesizing programs for data extraction and
transformation using examples, parsers from interactive labeling of
sub-expressions in a program, program refactoring sequences given an input and
output program, network policies from scenarios consisting of packet traces and
corresponding actions to the packets, optimal platform-tuned implementations of
signal processing transforms from their mathematical descriptions, compiler for
a low-power spatial architectures that partitions the program into optimal
fragments, efficient synchronization in concurrent programs, low-level
bitvector algorithms, type completions, hints and feedback for programming
assignments, and many many other cool applications [2]. Let me expand a bit
more on how synthesis techniques are being applied concretely in two
interesting domains of Data Wrangling and Computer-aided Education.
Data Wrangling &
Computer-aided Education
Data wrangling is a term coined to refer to the process of
converting data in a raw format to a suitable format that allows for subsequent
useful analysis. Since writing such data extraction/transformation scripts
manually is cumbersome and sometimes even beyond the programming expertise of
the users, some studies estimate that this process of data wrangling takes 80%
of the total analysis time. The FlashFill
system [3,4], often quoted as one of the top new features in Microsoft Excel 2013,
allows users to perform regular-expression based data transformations such as
splitting and merging by only providing a few input-output examples without the
need to write complex Excel macros or VBA. These techniques have been extended
to learn semantic data type transformations (such as date, address, etc.) and
table join/lookup programs by examples. The FlashExtract
system [5] allows users to perform data extraction from semi-structured text
files using few examples. The key idea behind these systems is to first design
a domain-specific language that is expressive enough to encode majority of the
data wrangling tasks but at the same time is concise enough for amenable
learning. These systems then use Version-space algebra based techniques to
efficiently search over the large space of programs in the DSL to learn the
programs that are consistent with the user-provided examples.
Computer-aided Education is another domain where synthesis
techniques are finding an interesting application. There has been a lot of
interest recently in making quality education accessible to students worldwide
using education initiatives such as EdX, Coursera, and Udacity. The massive open
online courses (MOOC) on these platforms are typically taken by hundreds of
thousands of students, which presents a massive challenge to provide quality
feedback to these students in the same way teachers do in traditional
classrooms. Since there are several different ways to solve a given problem
especially in programming courses, we cannot simply use a syntactic approach to
provide feedback. The AutoProf
system [6] uses synthesis techniques to automatically generate feedback on
introductory programming assignments. Given an incorrect student submission, a
reference solution, and an error model capturing the set of common mistakes
students typically make for a given problem, AutoProf uses the Sketch synthesis
system to compute minimal changes to the student program such that it becomes
functionally equivalent to the reference solution. It has been used by TAs to
grade homework assignments, and is also being integrated on the MITx platform.
A similar auto-grading system CPSGrader [7]
was developed for grading laboratory problems in the area of cyber-physical
system using parameter synthesis techniques, and was successfully used for
grading thousands of submissions for the online edX class. AutomataTutor [8] is another such
system that uses synthesis techniques for automated grading and feedback
generation for finite automata constructions, and is currently being used by
thousands of students worldwide.
Isn’t Program
Synthesis just Machine Learning?
There has been a long ongoing debate on how machine learning
is similar/different from program synthesis, since program synthesis techniques
(machines) are essentially learning a program from some specification (training
data). However, there are some key differences. First, machine learning
techniques typically learn a complex higher-order function over a set of finite
feature values, whereas program synthesis techniques learn complex structured
programs in a Domain-specific language (sometimes recursive and even Turing-complete).
This also results in human-understandability of the learnt programs, which can
be manually inspected in contrast to complex higher-dimensional functions
learnt by machine learning techniques. Second, the synthesis techniques learn
from very few examples (often 1), whereas machine learning techniques requires
a large amount of data. Finally, synthesis techniques aren’t very robust to
noise in the datasets whereas machine learning techniques are able to handle
noise quite effectively. A more comprehensive comparison can be found in this
nice survey article [9].
What are the current
challenges and trends in Program Synthesis research?
The scalability of synthesis algorithm to learn larger and
larger pieces of code has been a perpetual challenge. A recent trend has been
to devise new ways to combine machine learning techniques with program
synthesis techniques for both scaling the inference process and also making it
more robust to handle noise in specification. Another important challenge has
been in developing the right user interaction model, where users can provide
specification iteratively in an interactive fashion instead of providing the
specification in a one-shot black-box manner.
So, are our
programming jobs going to be taken up by robots?
Well, not really, at least not in foreseeable future. With
the current synthesis algorithms and computational resources, I believe that
soon enough we will be able to leverage synthesis technology to automatically
write small complex functions using high-level specifications in our day-to-day
programming tasks, but the creative burden of coming up with the right design
and algorithms for composing these functions to build larger software artifacts
would still lie with the programmer. Though, one day we might reach the stage
where even the complex algorithms and designs can be automatically generated by
the synthesis algorithms. From one of the famous quotes of President John F.
Kennedy (slightly rephrased): “If we have
the talent to invent new machines that puts people out of work, we have the
talent to put them back to work.” Then, we will reinvent programming.
- Zohar Manna, Richard J. Waldinger: Synthesis: Dreams - Programs. IEEE Trans. Software Eng. 5(4): 294-328 (1979)
- Rastislav Bodík, Barbara Jobstmann. Algorithmic program synthesis: introduction. STTT 15(5-6): 397-411 (2013)
- Sumit Gulwani. Automating string processing in spreadsheets using input-output examples. POPL 2011: 317-330
- Sumit Gulwani, William R. Harris, Rishabh Singh: Spreadsheet data manipulation using examples. Communications of the ACM 55(8): 97-105 (2012)
- Vu Le, Sumit Gulwani: FlashExtract: a framework for data extraction by examples. PLDI 2014: 542-553
- Rishabh Singh, Sumit Gulwani, Armando Solar-Lezama: Automated feedback generation for introductory programming assignments. PLDI 2013: 15-26
- Garvit Juniwal, Alexandre Donzé, Jeff C. Jensen, Sanjit A. Seshia: CPSGrader: Synthesizing temporal logic testers for auto-grading an embedded systems laboratory. EMSOFT 2014: 24:1-24:10
- Rajeev Alur, Loris D'Antoni, Sumit Gulwani, Dileep Kini, Mahesh Viswanathan: Automated Grading of DFA Constructions. IJCAI 2013
- Sumit Gulwani, José Hernández-Orallo, Emanuel Kitzelmann, Stephen H. Muggleton, Ute Schmid, Benjamin G. Zorn: Inductive programming meets the real world. Communications of the ACM 58(11): 90-99 (2015)
If you liked this you may also like:
Julig, R.K., "Applying formal software synthesis", in Software, IEEE , vol.10, no.3, pp.11-22, 1993.
Kant, E., "Synthesis of mathematical-modeling software", in Software, IEEE , vol.10, no.3, pp.30-41, 1993.
Mueller, R.A.; Duda, M.R., "Formal Methods of Microcode Verification and Synthesis", in Software, IEEE , vol.3, no.4, pp.38-48, 1986.
Abbott, B.; Bapty, T.; Biegl, C.; Karsai, G.; Sztipanovits, J., "Model-based software synthesis", in Software, IEEE , vol.10, no.3, pp.42-52, 1993.
Thanks for your valuable post.
ReplyDeleteobiee training in chennai
Interesting and useful article.
ReplyDeletesoftware training in chennai