Flipping Out

From programming_contest
Jump to navigation Jump to search

This problem gives us a series of coinflips. The series is generated by several fixed generators which are also serieses of coin flips. When generating the series, if the number of total matches of each generator string with the previously generated output is even, the next output is tails, else heads. The problem supposes there is one generator series missing, and asks how many possibilities are there for the missing generator given the other generators and the intended output series.

We can find what the missing sequences would be by simulating the generation of output string from the given generators. When we reach an output that comes out incorrectly with respect to the known generators, we know that the missing generator must terminate at the character we're on. That makes every string that terminates at that character a candidate. We know, however, that none of those candidates should have appeared EARLIER in the output string, since that would have caused some earlier output to be different, and we know it's not.

Next time we come across an incorrect character, we can check against all our candidate sequences for any that match at that index, eliminating any that don't. If we have no candidate that matches all points which are incorrect in the output string, there is no solution (or if the first character is an H...). If we have a situation where there are NO incorrect characters in the output, there are infinite solutions (any generator that doesn't match any substring of the output will do...of which there are infinite choices).

So now the only question is how to do this efficiently.

The naive method walks the input string and at each step compares with all previous strings. At each step in the output string, we have to compare with all of our generator strings to see if there are any new matches. Given an output string of length n, and the fact that the TOTAL number of characters is 10^6, this bounds the total length of the generator strings to 10^6 - n. Further, the time to do the comparison is bounded by the total length of the generator strings. This makes the total time for all comparisons n * (10^6 - n). This is maximized when n = 10^6/2, which is still far too large.

Instead of going character by character, can we instead pre-compute the location of all string matches? Given the same constraints above, this will still be too slow.

What repeat work are we doing?

consider two generator strings

  • TTT
  • TTTH

In all our proposed solutions, we would consider them independent. We can see intuitively, however, that trying to match TTT does most of the work for trying to match TTTH. We just have to figure out a way to encode the generator strings to allow us to take advantage of that entropy. The natural solution is to build a prefix tree.

  1. build a prefix tree of the generator strings
  2. iterate through the output strings, creating a new pointer into the tree for each new character
  3. at each character, walk all the current pointers into the tree, traversing them to the appropriate child node for the current character in the output string
  4. each time we read a new character, update all the pointers

It turns out this is too slow, since we can have n pointers, and n characters.

There is repeat work at play here, as well, since finding

  • HTTT
  • TTT

Will overlap, but the prefix tree doesn't help there.

Fortunately, Aho-Corasick solves this problem in linear time and allows us to find every match. We can then sort the matches, count the number that terminate at any given element, and track the prefix sum to figure out how many terminate earlier than a given index. We can then process the output string in linear time to find the inconsistencies. "All" that's left is to identify the candidates. We know any candidate must adhere to the following three properties:

  1. must be some common substring which terminates at ALL "incorrect" indices
  2. must not occur anywhere ELSE in the tree

Assuming we could find the longest common substring quickly, we can just use the Aho-corasick algorithm again against all the substrings to find the shortest one which doesn't match more than it's supposed to (or even binary search the min-length and KNP search each one if you wish). How to find the maximal length, though? We can effectively do the same thing! Simply take every substring and make sure it matches often ENOUGH (instead of too often).

  1. use aho-corasick to identify all matches of all generator strings
  2. sort the match indices and create a prefix sum
  3. walk the string and identify the incorrect indices
  4. take all length prefixes of the earliest mismatch as candidates
  5. use aho-corasick to identify all matches of the candidates
  6. find the shortest candidate which doesn't match anywhere else in the string and the longest which matches all the mismatches
  7. the difference between the two lengths are the possible solutions