Lessons from studying Go with AI

I used to play a lot of Go around 2010-2016. I am not a very strong player however. My European rating is 2 kyu, which corresponds to a somewhat average club player in Finland.

I remember that studying Go used to be difficult because there were no good computer programs to learn from. If you wanted to know how to improve your game, you had to ask a stronger player for advice. However these stronger players were not always available, and sometimes they gave conflicting advice.

Then, in 2016, AlphaGo happened. This was an AI that for the first time reached and surpassed human professional playing strength. Unfortunately, the program was not publicly available, and even if it were, it required an expensive cluster of 48 tensor processing units to run, making it unavailable for the average Go player. Around this time my interest in Go faded.

Now, in 2022, there are open source projects that replicate and improve upon the architecture in AlphaGo. The neural network has been made more efficient so that it is possible to have a professional-strength AI running on a home PC. There are also graphical user interfaces that help exploring the game tree variations calculated by the AI. Could these new tools help me improve my game?

Lessons

I have been playing Go again at the KGS go server for the past month. Most of my games are on my account that is firmly stuck in the rating of 1 kyu (KGS 1 kyu is roughly equivalent to European 2 kyu). I also have another account which is at 1 dan, which I use only when I feel that I’m focused and playing well.

After each game, I have loaded up the game in Katrain, which is a graphical user interface for the Katago AI. The AI runs on my Nvidia RTX 2070 GPU. In this post I write about the things I have learned from studying with the AI. If you’re not a Go player, the rest of the post will be hard to follow. To learn Go, try this guide, which I myself originally used to learn the rules.

Lesson 1: The opening does not matter

When studying Go with stronger players at the club, we would often have long discussions about opening strategy. Things like which corner approach to use, which joseki to choose and which side extension is the best. After looking at many openings with the AI sensei, I’ve come to the conclusion that the opening does not really matter. All moves that follow standard opening principles are usually OK. Some move may be 1 point better than the other, but in the grand scheme of things, at the 2 kyu level, that does not really matter much.

Using video game terminology, the purpose of the opening is actually to set up the map of the game. In games like Starcraft there are only a few maps available for ranked play. In Go, you and your opponent build a new map for every new game. If you want an influence-oriented map, you can play a lot of stones on the fourth line. If you want a territory-oriented map, you can try the 3-3 point. It’s a blessing that all of these choices are more-or-less valid ways to play, at least in average amateur play.

Move 8 of an opening: white to play. The numbers on the moves are the expected point loss compared to optimal play. Almost all moves that are on the third line or above are playable. All normal corner approaches are good.

There are a lot of non-josekis that are only slightly worse than the best sequence. You don’t have to know the most accurate variants. Often you can play something almost as good and lose only 1-2 points. This loss is negligible compared to losses in the middlegame.

Lesson 2: The endgame does not matter

In a game with two equally matched opponents, the score rarely moves more than 8 points total during the endgame. It’s not because both players are playing perfectly, but because both players play at roughly similar accuracy and the mistakes cancel out. On the other hand, during the middlegame, the score usually swings wildly from -30 to +30 points or even more. A single bad move in the middlegame can undo everything that you can realistically gain with careful endgame. Thus, studying and optimizing the endgame is not very important on my level.

Lesson 3: The actual game happens during the middlegame

The middlegame is where games are won or lost. It feels like every other move presents a tough choice. For example, is it possible to hane here? Which one of my groups needs the most help? Is it possible to tenuki yet? Is this peep sente? Is it possible to invade? Does this cut work? Et cetera.

A very significant number of games are decided by one of the players misreading a life-and-death problem. Even if no group dies, playing an extra move to a group that does not need it can cost more than 10 points because it sometimes is not much better than passing. Even if it does not straight up win you the game, life and death often influences things like whether a move is sente against a group, which is very important to know.

A lot of games are decided in a big capturing race. Learning a good intuition for whether a capturing race is favorable is very important. Once the race has started, it will win you games if you know the right techniques and tesujis to minimize the opponent’s liberties and maximize our own.

Judged by the AI, an average KGS 1 dan player is actually pretty bad at life and death and capturing races. It seems to me like if you are just good at these things, you could easily be 3 dan or better even if you don’t know much about opening theory or endgame.

Lesson 4: Shape is important

A lot of Go playing strength is just having a mental database of strengths and weaknesses of particular shapes. You should have a strong prior toward playing good shapes, and deviate only when you have a very specific explainable reason against it, like having to cover a cutting point.

Many times during a game I knowingly played a bad shape because I thought that situation was an exception. It usually wasn’t. As a lesson I would probably do better if I just never played some bad shapes, like this one:

Playing one line above or below the circled move is almost always better.

Final thoughts

I feel like AI has helped me a lot to focus my efforts on things that actually matter in my games. I try to ignore small mistakes that are less than 1.5 points and instead focus on larger mistakes and try to figure out why they are mistakes. Usually it is quite obvious in retrospect.

It seems to me that most of my losses currently are due to bad tactical judgement and misreads. I still mess up life-and-death problems that a 10 kyu could solve. But on the bright side, my opponents at KGS 1 dan are making similarly bad mistakes. It seems at least feasible to reach KGS 2 dan or stronger with practice. There are no mysterious dark arts to learn to play at that level – just play solid and don’t screw up so much. Easier said than done, huh?

Oura ring first impressions

My Oura ring arrived in the mail today. In this post I write about my first impressions with the ring. For those unfamiliar with Oura, it’s a piece of wearable electronics that costs about 300 euros and tracks things like your heart rate, body temperature and steps walked. You can view your statistics by using the Oura app on your phone, which receives data from the ring via Bluetooth. It’s a very neat idea, but there are some things that definitely rub me the wrong way.

The subscription model

Unfortunately, the app keeps most of your data hostage unless you pay a bullshit 6€/month subscription fee. This is very frustrating, since the data is there, but you are prevented from accessing it. I still bought the ring because the ring seemed to be the best product in this category on the market at the moment.

I get it that the company needs money, but I wish they made their money some other way, such as offering additional data processing or recommendations. Or they could take the mobile phone company route, and make a new model every year with some shiny new features to get people to buy new hardware. With the current model, you don’t even get to see your heart rate if you don’t pay.

Not cool Oura, not cool. I’m semi-seriously considering trying to reverse engineer the bluetooth communications to get access to my data without the subscription. I wonder if they put some kind of encryption in place to prevent this. It might be that the data flows through their servers which decrypt the data for their app.

I also feel somewhat uncomfortable that my health data is apparently stored on their servers. Why do they need to store my data? My phone has plenty of storage space. Is the data properly anonymized? I think not, because it’s linked to my account. What if there is a data breach? I would feel safer if the data never left my phone.

The company probably wants to run analysis on the customer data to improve their product. That is reasonable, but I wish there was an option to opt out.

Show me the algorithms

Oura runs some analysis on our data and summarises it into a few key metrics. They are called the readiness score, the sleep score and the activity score. As a data nerd myself, I would be interested in the methods and formulae used to compute these scores. I would like to see a paper that describes the methods in detail. The official documentation is quite vague. I realise the average user is probably not a bioinformatician like me, and probably not so interested in the technical details.

I am not really interested in simplified aggregate scores. I am more interested in the raw data, and I would like to do the interpretation myself. Fortunately, Oura does let you download your data in csv format, given that you pay the subscription fee, and I’m looking forward to doing that after I’ve used the ring for a few weeks.

All that being said, the ring and app do seem to work. When I went for a run today, it correctly recognized the activity and calculated estimated calories burnt. I did not get the heart rate during workout, because it turned out that I needed to enable that separately. The app also works as a fitness tracker where you can log your workouts.

There are still many features which I have not explored, such as the sleep tracker. I might write a follow-up post later down the line. My first impressions are okay, even though the subscription model is annoying to me.

Intuition on Polynomial Running Times

The quadratic function $f(n) = n^2$ is a common growth rate in computer science. How does it grow, intuitively? I’ve found it useful to think about the function in terms of what happens when you double the size of the input $n$. In the case of the quadratic function, doubling the input size always multiples the running time by a factor of four. The proof is short: $f(2n) / f(n) = (2n)^2 / n^2 = 4n^2 / n^2 = 4$. Generalizing, multiplying the input by a factor of $k$ multiplies the running time by a factor of $k^2$.

This property continues to hold even if there is some constant factor $a$ so that the function has form $f(n) = an^2$. We may want to also consider lower-order terms, making the function of the form $f(n) = an^2 + bn + c$. Now the above property still holds when $n$ is large enough that the lower-order terms $bn + c$ become insignificantly small compared to the term $an^2$. More formally: $$\lim_{n \rightarrow \infty} \frac{f(kn)}{f(n)} = \lim_{n \rightarrow \infty} \frac{a \cdot (kn)^2 + b \cdot (kn) + c}{an^2 + bn + c} = k^2 \; .$$

I find it useful that this property always eventually holds for any quadratic irrespective of what the values of $a,b$ and $c$ are. How to use this result in practice? Suppose you have run an algorithm with a quadratic run time on a test set that is 5% of your whole input, and it took 1 hour. How long does it take to run the whole data set? Your whole input is $20$ times larger, so it will take $20^2 = 400$ times as long, that is, 400 hours given that the input is large enough so that the lower order terms do not matter.

Past quadratic

Ok then, what about the cubic function $f(n) = n^3$? Now multiplying the input by $k$ multiplies the running time by $k^3$. More generally, if the function is of the form $f(n) = n^d$, then multiplying the input by a factor $k$ multiplies the running time by a factor $k^d$.

Suppose we are benchmarking some algorithm, which we know has a polynomial running time, but do not know which degree. To figure out the degree of the polynomial, we could try to fit different polynomial curves on the running time, but there is an easier way. We can run the algorithm on two large inputs of sizes $n$ and $2n$, infer the polynomial degree from the property above. For example, if the running time increased by a factor 8, we know the function is cubic. More generally, if we run the above experiment, and the running time increases by a factor $r$, then the degree of the polynomial is $\log_2 r$.

Predicting nucleotides in the E. coli genome

In this post I model the reference sequence U00096.3 of Escherichia coli strain K-12 using Markov models and simple neural networks. The goal is to try predict to the nucleotide in some position of the genome, given some number of preceding nucleotides as a context.

Let’s start with some basic statistics. There are 4.6 million nucleotides in the genome. All four nucleotides are approximately equally common:

ACGT
24.62%25.42%25.37%24.59%
Distribution of nucleotides in E. coli reference sequence U00096.3

Guessing nucleotides randomly always gives a prediction accuracy of 25%. We can improve to 25.42% by always guessing the most common nucleotide C. One way to try to improve on this is to use a Markov model.

Markov models

In our setup, a Markov model of order k is a model that predicts the next nucleotide in a sequence based only on the k previous characters of the sequence. The k previous characters are called the context. The model learns a probability distribution for the next nucleotide for each of the 4^k possible distinct contexts. We can take the most likely nucleotide in the distribution as the prediction of the model.

For example, a Markov model of order 2 would have 16 contexts: AA, AC, AG, AT, CA, CC, CG, CT, GA, GC, GG, GT, TA, TC, TG and TT. Each of these would be associated with a probability distribution over A, C, G and T. For example, for the context TG, we could have P(A | TG) = 0.2, P(C | TG) = 0.1, P(G | TG) = 0.5, P(T | TG) = 0.1. This model would predict that if the context is TG, the next nucleotide is most likely G.

The distributions for each context can be easily estimated from the reference genome. We can simply find all occurrences of a context, and see what distribution of characters are followed by the context in the data. If a context does not occur at all in the training data, we assume a uniform distribution.

To evaluate how well this works, I build a training dataset and test dataset. I put 50% of the positions in the reference genome to the training set, and the rest to the test set. The distributions are estimated from positions in the training dataset, and then the model tries to predict the nucleotides in all positions of the test set. I repeat this experiment for all possible Markov model orders from 0 to 16.

The best model turns out to have order 6, with an accuracy slightly above 34%. Models with a smaller degree are too coarse (underfitting), whereas models with a larger degree are too specific (overfitting).

Neural networks

If we increase the order of Markov model too much, the contexts are so specific that almost all of them will occur only once in the genome, and hence we can not estimate a reasonable distribution for the next nucleotide. However, there could be longer-range dependencies in the nucleotide sequence than what a low degree Markov model can exploit.

Neural networks are one possible way to try to exploit longer-range dependencies. The input to the network can be a context of any length, and the output will be a distribution for the next nucleotide.

I tried three simple architectures. All three take as an input a one-hot encoded representation of the context, and output a distribution for the four nucleotides using a final softmax layer of size 4. The first and second architecture have a single hidden layer with 10 and 100 neurons respectively. The third architecture has three hidden layers of 100 neurons each. All layers are fully connected. I trained models in Tensorflow 2.0 with context lengths 1,2,4,8,16,32,64 and 128, with one full pass through the training data, using the default Adam optimizer and the categorical cross-entropy loss function. The final accuracies on the test set are plotted below. Each line corresponds to one hidden layer architecture. The red horizontal line shows the accuracy of the best Markov model.

Visualizing k-mer statistics of bacterial genomes

Let us start with a brief recap of the biology of gene expression. A genome is a string of nucleotides A, C, G and T. This string is the source code for proteins that the cell can produce. Proteins are strings of amino-acids, where the amino-acids are selected from a set of 20 naturally occurring types. A string of nucleotides A, C, G and T is translated into a string of amino acids by interpreting consecutive non-overlapping triplets (3-mers) of nucleotides as codewords (codons) for the different kinds of amino acids. Since the number of possible distinct triples of nucleotides is 64, and there are only 20 distinct amino acids, most amino acids are encoded by multiple different codons. Two codons that encode the same amino-acid are called synonymous.

It is known that bacterial species tend to favor some synonymous DNA codons over others. This phenomenon is called codon bias. The reason for these biases is a debated topic in molecular evolution. One explanation is that differences in the transfer RNA (tRNA) pool of the cell lead to discrepancies in the transcription efficiency of synonymous codons, which leads to mutational pressure toward the more efficient codons.

It turns out that every species has a unique characteristic codon bias. This suggests that the molecular pathway from DNA to protein favors different codons in different bacteria. It is not clear whether codon bias evolves to adapt to the tRNA pool and other conditions in the cell, or vice versa.

In this post we examine the codon usage and k-mer statistics of a reference genomes of Escherichia coli and Vibrio cholerae.

3-mer distributions

First, let’s just plot the distribution of nucleotide triplets in E. coli.

There is a pattern in this distribution: the frequency of a 3-mer is strongly correlated with frequency of its reverse complement. This can be seen clearly when we scatterplot the frequencies of a 3-mer against the frequency of the reverse complement.

This can be explained by the double stranded structure of DNA. Genes are encoded in both complementary strands of the DNA, and one strand is the reverse complement of the other. The reference genome is just one arbitrarily chosen strand of the DNA, and genes from the other strand appear reversed and complemented on the reference strand. The same selective pressures for codons apply in the reverse complement.

Let us now check the codon distribution of V. cholerae:

These statistics are quite different compared to E. coli. To see better, let us draw the scatterplot comparing codon frequencies between E. coli and V. cholerae. Each dot is one codon, the x-coordinate gives the frequency in E. coli, and the y-coordinate the frequency in V. cholerae.

Barcodes

It also turns out that the codon usage statistics remain stable over the whole genome. To visualize this, we use the barcode plot, introduced by Fengfeng Zhou, Victor Olman and Ying Xu. In the method, the genome is divided into non-overlapping segments (“windows”), and the 3-mer distribution of each segment is computed. The data is plotted as a black-and-white image, where each column represents a genome segment, each row represents a 3-mer, and the brightness of a pixel is proportional to the frequency of that 3-mer in that segment.

If the 3-mer composition remains the same in every segment, every row of pixels should have a constant brightness. Running the experiment confirms that this is approximately true. Below is the barcode plot of E. coli with a window size of 10000 nucleotides.

E. coli 3-mers, windows size 10000

It’s a bit hard to see because the image is only 64 pixels tall, but it’s evident that there are long horizontal lines. However, there are some breaks in the lines. Perhaps those regions are recently acquired genes which have not yet been adapted to the codon usage patterns of the new host? In Vibrio cholerae, we see a longer horizontal break:

V. cholerae 3-mers, windows size 10000

Building the barcode plot for quartets of nucleotides instead of triplets shows a similar pattern. Now each row represents a 4-mer and there are 4^4 = 256 rows.

E. coli 4-mers, window size 10000
V. cholerae 4-mers, window size 10000

Decreasing the window size from 10000 to 1000 shows even more detail. Be sure to click the images for full-sized versions.

E. coli 4-mers, window size 1000
V. cholerae 4-mers, window size 1000