In 1988, Finnish computer scientists Esko Ukkonen and Jorma Tarhio conjectured that the simple greedy algorithm for shortest common superstring has a worst-case approximation ratio of 2. Amazingly, the problem is still open almost 40 years later, despite substantial efforts and partial results by many authors (see e.g. https://arxiv.org/abs/2407.20422v1 for a survey of what is currently known about the problem).

Problem statement

The problem is as follows. A string $A$ is considered a substring of string $B$ if there is an occurrence of $A$ inside $B$. For example, the string “art” is a substring of “cartoon”. A superstring is the opposite of a substring: “cartoon” is a superstring of “art”.

The shortest common superstring problem asks: given a set of input strings $A_1, \ldots, A_m$, what is the length of a shortest possible string that is a superstring of each of the input strings. For example, for input strings “rhino”, “tar” and “magenta”, the answer is “magentarhino”. We operate under the assumption that no input string is a substring of another, otherwise we could remove the substring from the input without changing the answer.

The greedy algorithm for the problem is as follows. Repeatedly transform the input by merging an ordered pair of strings with the longest overlap (breaking ties arbitrarily), and repeat until just one string remains. For example, the overlap of “tar” and “magenta” is 2 because we can combine the two words into a single word “magentar” by sharing two letters. The greedy algorithm will run as follows:

“rhino”, “tar”, “magenta”
“rhino”, “magentar”
“magentarhino”

In this case, the algorithm happens to produce the optimal superstring. Sometimes it does worse. The approximation ratio of the algorithm is defined as $|G| / |S|$, where $|G|$ is the length of the string produced by the algorithm, and $|S|$ is the length of an optimally short superstring. The conjecture is that $|G| / |S| \leq 2$ for any run of the algorithm with any tie-breaking method between overlaps of equal length.

In their 1988 paper, Ukkonen and Tarhio show that in the worst case, the ratio must be larger than $2 – \varepsilon$ for any $\varepsilon > 0$, so proving that $|G| / |S| \leq 2$ would complete our understanding of the worst-case behaviour of the algorithm with matching upper and lower bounds.

This might be within reach of AI

It feels to me that this could be one of the statements whose proof boils down into a massive case analysis, like the proof of the four-color theorem. But unlike the four-color theorem, this problem is much less known. If we put a swarm of 10,000 talented computer scientists working at it full time, maybe the problem would be solved in short order, but the problem is not important enough for that.

But wait, we have math-capable AI now! If it is indeed just tedious case analysis and grinding of formulas, maybe it’s just the right kind of problem for AI. In my experience, the latest thinking AI models are remarkably good at mining for patterns in discrete problems like this. For example, they will automatically write code to enumerate all small instances, make conjectures from that, and try to verify or refute their conjectures with more code. They are also very good at proving elementary and even some not-so-elementary lemmas.

So I let ChatGPT5.2-Thinking loose on the problem. I managed to get it to prove the result for inputs containing up to four strings, but it could not do it for five strings or more. Here’s how it went.

Greedy superstring for a small number of input strings

First I straight-up asked the AI (ChatGPT5.2-Thinking extended) to prove the conjecture. It was hesitant to put in a serious effort, saying that the problem has been open for a long time, and instead started to give me a survey on what is known about it. It took some encouraging to get it to work for real. Here’s how I prompted it:

I know it’s open. But you’re an extremely capable AI with generic problem solving skills. I want you to explore the problem, come up with conjectures, test them, and prove them. Similar to how you would approach a project Euler problem. I know you have the capability. This is not some huge open problem like P = NP. This is well within reach of existing methods. Go!

Now it actually gave me something interesting. It pointed out that one of the formulas in Ukkonen’s and Tarhio’s original 1988 paper immediately implies the result for the special case when the input consists of exactly three strings. Indeed, let $n$ be the total length of the input strings, $k$ be the length of the output of the greedy algorithm, and $s$ be the length of an optimal superstring. Ukkonen and Tarhio show that $k \leq (n+s)/2$. We also have that the superstring has to be at least as long as the longest input string, which is at least as long as the average input string, so we have $s \geq n/3$, and combining this with the other formula gives $k \leq 4s / 2 = 2s$. Does this generalize to four input strings? Unfortunately no, because then we don’t have $s \geq n/3$ but instead $s \geq n/4$, which only leads to $k \leq \frac{5}{2}s$.

It then tried to find an argument for four strings, but failed to find one within 10 minutes of thinking. It did however run some brute force checking to verify that the conjecture holds for small four-string inputs. I pushed it forward to do more experiments. It worked for 25 minutes and concluded that the worst-case inputs appear to have lots of overlaps and many equally good options for merging. Then, I asked it to try to extend the proof for three input strings to four strings. It thought for 20 minutes and told me that it does not generalize, but it could try classifying all overlap configurations in an attempt to find a different proof. I told it to go ahead and it thought for 20 minutes and it had broken the problem to subcases and asked me whether I wanted it to dive into one of the subcases. I told it to go ahead and it thought for another 17 minutes through a very messy-looking case analysis, but still no proof. After some more pushing, to my surprise, it finally found a working argument! It told me that it had solved one subcase but then realized that the same reasoning actually applied to all cases and actually proved the whole result for four strings! It uses just very elementary reasoning and arithmetic, and is easy to verify. Here’s how it goes:

Suppose the input strings are $A,B,C$ and $D$ with total length $n$. Let ov($A,B$) denote the overlap between $A$ and $B$ (considering both ways: putting A before B, and B before A, and taking the longer overlap), and let $s$ denote the length of an optimal superstring.

Lemma 1: For any two strings $x, y$, we have ov($x,y$) $\geq \max(0, |x| + |y| – s)$.
Proof: Consider the leftmost occurrences of $x$ and $y$ in a shortest superstring. Let $I_x$ be the interval of indices where $x$ is, and $I_y$ the interval of indices where $y$ is. Since no input string is a substring of another, the intervals do not nest. We have ov($x, y$) $\geq |I_x \cap I_y| = |x| + |y| – |I_x \cup I_y| \geq |x| + |y| – s$.

There will be three merges in the algorithm. Denote the lengths of the overlaps in the merges with $m_1, m_2$ and $m_3$. Now, suppose $A,B$ is the first pair of strings that are merged in the algorithm. Then $m_1 =$ ov($A,B$). Since after the first merge the pair $C,D$ is still present, the overlap length of the second merge $m_2$ satisfies $m_2 \geq$ ov($C,D$). The final length will be $k = n – m_1 – m_2 – m_3 \leq n – m_1 – m_2 \leq n – $ ov($A,B$) – ov($C,D)$, and applying Lemma 1 to this gives $k \leq n – (|A| + |B| – s) – (|C| + |D| – s) = n – (|A| + |B| + |C| + |D|) + 2s = 2s$. This proves the conjecture for four strings!

I have not seen this result for four strings in literature, and ChatGPT cannot find it either. Maybe people just thought that it’s obvious and didn’t bother to write about it. But I think it’s interesting because it pushes the boundary of how many strings we can reason about at once. The next obvious question is then, how about five strings? Unfortunately, this argument for four strings does not generalize. After pushing the AI for a few hours, it failed to prove the result for five strings.

I find this curious because having just five strings still feels like a finite setup with finite complexity. But one cannot easily enumerate and check “all cases” since the strings can be arbitrarily long and thus the number of possible input instances is infinite. A way to attack the problem could be to classify the input instances into a finite number of classes according to the overlap structure, and prove the result for each class separately. Five strings sure sounds a lot more tractable than an unbounded number of strings.

Therefore, I propose the 5-string greedy superstring conjecture, which is the proposition that the greedy superstring algorithm has a worst-case approximation ratio at most 2 for an input with exactly five strings. Would a proof of this five-string conjecture already have enough structure to generalize to the full conjecture for any number of strings? Or if not, how far can we push this? Six strings? Seven?

The chat log is here: https://chatgpt.com/share/696ead4d-f138-8009-b705-797d03c540b5