View on GitHub

Program code generation using neural networks

Automatic test script generation (in Russian)

15.05.2018





What is neural program synthesis?

12.04.2018

In recent years, Deep Learnig has made considerable progress in areas such as online advertising,, speech recognition and image recognition. The success of DL lets us to change the view on the way the software itself is being created. We can use neural nets to gradualy increase automation in the process oft program creation, and help engineers to get more results with less efforts.

There are a great deal of applications for program synthesis. Successful systems could one day automate a job that is currently very secure for humans: computer programming. Imagine a world in which debugging, refactoring, translating and synthesizing code from sketches can all be done without human effort.

What is Thousand Monkeys Typewriter?

TMT is the system for program induction that generates simple scripts in a Domain-specifil language. The system combines supervised and unsupervised learning. The core is the Neural Programmer-Interpreter, is capable of abstraction and higher-order controls over the program. The system works for error detection in both user logs and software source sode.

TMT also incorporates most common conceprions used today in a field of program synthesis are satisfiability modulo theories (SMT) and counter-example-guided inductive synthesis (CEGIS).

Types of data

There are two types of data (logs) that we are analyzing:

Supervised and unsupervised

To analyze logs, we are using both unsupervised technique (Donut for user logs), and supervised (engineers mark anomalies in software traces using j-unit tests).

NPI

NPI is the core of the system. It takes logs and traces and learns probabilities at each timestep and environment.

Neural Programmer (NPI) consists of:

  1. RNN controller that takes sequential state encodings built from (a) the world environment (changes with actions), (b) the program call (actions) and (c) the arguments for the called program. The entirety of the input is fed in the first timestep, so every action by the NPI creates an output that is delivered as input.
  2. DSL functions
  3. Domain itself where functions are executed (“scratchpad”)

NPI illustration

At the time, TMT generates simple scripts for anomaly detection in production logs.

How generator works

DATA

At the moment, we analyze three types of logs: user logs, database logs, software traces.

Detect anomalies

Then, we are trying to detect any problems that logs contain. What exactly are anomalies? Simply put, an anomaly is any deviation from standard behavior.

Normal data representation: data

Point anomalies, which are anomalies in a single value in the data: data

Query execution time anomalies: detectum

We are aimed to detect anomalies in situtations such as: memory leaks, bottlenecks in Java runtime, server infrastructure problems etc.

As a result, we acquire training data, either labeled manually (supervised), or labeled by automatic classificator (unsupervised).

Train Neural Programmer

After we get a list with labeled normal and abnormal events, we train our core to differ what’s normal and what’s not in trhe future.

In case of unsupervised learning, the process can be described as “one neural net teaching another”:

event in log was labeled as normal: detectum

event in log was labeled as abnormal: detectum

db query was labeled as normal: detectum

db query was labeled as abnormal: detectum

In some cases, where situations by default are labeled as normal, we have only to decide what command to call next. detectum

Working with the scrpits in runtime

Having trained NPI means that, at each step, we have a predicted operation from argumeents and environment. Thus we expect from a well trained model to predict each command and each step, indicating whether this observed sutuation in logs (software traces) is normal or not. If normal, we expect one outcome, of not - another.

In other words, the model would predict an outcome from given state: label (by default, “normal”), argument and environment. Each combination of this parameters could produce different outcomes.

sample normal runtime script with environment:

BEGIN 
DIFF 
DIFF
CHECK
MO_ALARM

alert runtime script:

BEGIN 
DIFF 
DIFF
CHECK
ALARM

Data environment

DIFF ({'program': {'program': 'diff', 'id': 6}, 'environment': {'date1': 15, 'output': 0, 'answer': 2, 'terminate': False, 'client_id': 2, 'date2': 20, 'date2_diff': 45, 'date1_diff': 93}, 'args': {'id': 29}})

Challenges

One of the problems with NPIs is that we can only measure the generalization by running the trained NPI on various environments and observing the results. And as we explained earlier, every change of the peremeters can produce a new script.

For the sake of simplicity, we want co create on scipt that will cover many (all) situations: detectum

This means that we are still have to make a module that will merge all possible cripts from this particular NPI into the smallest number of scripts possible, preferably one script.

Examples

References

Deep Learning: A Critical Appraisal

Andrej Karpathy: Software 2.0

Neuro-Symbolic Program Synthesis

Improving the Universality and Learnability of Neural Programmer-Interpreters with Combinator Abstraction

A curated list of awesome machine learning frameworks and algorithms that work on top of source code

Neural programmer concepts

RobustFill (Microsoft)

DeepCoder (Microsoft)

Program Synthesis with Reinforcement Learning (Google)

Bayou (https://github.com/capergroup/bayou)

Tree-to-tree parser

Kayak (DiffBlue)

Anomaly detection

Unsupervised Anomaly Detection via Variational Auto-Encoder for Seasonal KPIs in Web Applications

Anomaly Detection for Industrial Big Data

Faster Anomaly Detection via Matrix Sketching

Our monkey