perlgen

Genetic algorithm for generating valid perl

Intro

After reading this node, I began working on machine-generating, randomly or otherwise, valid perl. This is a rather pointless excersise, the main goal of which is to generate ``interesting'' results.

Brute force

My first approach was naive and slow. It used a brute force search, generating and validating every string with a given set of characters. This is obviously slow because of the large keyspace (invoking the perl interpreter for every string is a major hit as well). Also, because of the sequential search, it usually takes a long time for the results to become interesting. Consider the starting string ``aaaaa''. The search would have to run through the entire four-character keyspace before the first letter changed.

Genetic Algorithm

After some meditation, I began working on a genetic algorithm[1] approach. Obviously the two approaches differ in results[2]: the brute force method generates /all/ valid perl code, where the GA generates the ``fittest.''

The fitness function of this GA cannot be based solely on the validity of the code, as booleans don't make for good fitness functions <g>. Code is either valid or it isn't. With a simple right/wrong answer, there is no way for the GA to converge on a solution. Conveniently, however, valid perl is quite easy to generate randomly. There is no need to intelligently prune through the population until a valid string is reached: odds are that we've started with a few already. Thus, all the fitness function needs to determine is how ``interesting'' the code is. The main difficulty, then, is objectively describing ``interesting.''

Fitness function #1: ugly series of regexes

As the heading implies, the first fitness function was.. suboptimal. It started by checking to see if the individual was valid perl (by compiling it, as in the brute force example). If it failed to compile, it would receive a score of zero. If compilation succeeded, the fitness function would continue, assigning an extremely low score if the given individual was POD, or a quoted string (considered ``cheating''). It then gave points to code with space separated words, and removed a few points for other lesser forms of ``cheating,'' such as having spaces at the start or end of the individual. At this point, my (already limited) creativity ran dry, and I could think of no more tests to add to the fitness function. It also became clear that continuing to expand the current function would end up an aborted, half-baked attempt to parse perl[3]. A snippet of the function follows:

  ## quotes and POD is cheating.  Comment char isn't included,
  ##   so don't bother with it.
  $ret-=STR_LENGTH if (substr($s, 0, 1) eq '='); #it's a POD
  $ret-=STR_LENGTH if ($s =~ /^q(.).*?(\1)$/);   #q() string?
  $ret-=STR_LENGTH if ($s =~ /^([`'"]).*(\1)$/); # quoted strings or backticks

  $ret-=15 if ($s =~ /<[^>]{15}>/); # significant portion of code is a file(handle|glob)

  ## Spaces at the start and end
  $ret-=2 if ($s =~ /^ /);
  $ret-=2 if ($s =~ / $/);

  $ret+=2 if ($s =~ /\w \w/); # space separated words tend to make things interesting

  if ($ret < 0) { $ret = 0 };
  return $ret;

Fitness function #2: complexity

At this point I realized that, looking over the first fitness function, I was really looking for complexity in the generated code. A quoted string, for example, receives a low score simply because it is not complex. The next attempt, then, would analyze the number of operations and components of the generated individual.

Enter PPI. PPI is a set of modules that ``Parse, Analyze, and Manipulate Perl (without perl).'' I made use of PPI::Tokenizer to split each individual into tokens. The fitness function then assigned a score based on the number of tokens found. I kept a few of the regexen from the previous function to work around a bug in PPI (It does not, at the time of this writing (2007/06/21), properly handle globs and ?..? rexeges). I also capped the maximum score assigned at a level just below the number of possible tokens, which I found made the results slightly more interesting.

At this point, the GA worked beautifully, and took about thirty seconds to run[4]. After profiling, I replaced the shell call to ``perl -ce'' with a Safe compartment. The code now takes, on average, ten seconds to run on the same machine.

The current script can be found here: perlgen.pl

Results

Samples of the generated results, along with the Deparse'd versions. (In the B::Deparse output, '???'s are constants which have been optimized away by the compiler)

Generated:
x,~&y& *stl;v<gf.*< 

Deparsed:
------------
'???', ~&y & *stl;
'v' < 'gf' . *<;
------------

Generated:
cl;ur r~w^9*\x$u/-ff

Deparsed:
------------
'???';
'r'->ur(~'w' ^ 9 * \$u->x / -'ff');
------------

m9l;o*bd3 +w;f+5+x+k

Deparsed:
------------
'???';
'o' * 'bd3' + 'w';
'f' + 5 + 'x' + 'k';
------------

[1] Consult the great wiki in the sky: Genetic Algorithms.
[2] The nice thing about having an undefined goal is the ability to drastically change the scope of the project.
[3] Google for ``only perl can parse Perl'' to find out why this never works.
[4] On a 2130 Mhz processor, with 512MB RAM.

back contact