In this post I model the reference sequence U00096.3 of Escherichia coli strain K-12 using Markov models and simple neural networks. The goal is to try predict to the nucleotide in some position of the genome, given some number of preceding nucleotides as a context. Let’s start with some basic statistics. There are 4.6 million […]
Visualizing k-mer statistics of bacterial genomes
Let us start with a brief recap of the biology of gene expression. A genome is a string of nucleotides A, C, G and T. This string is the source code for proteins that the cell can produce. Proteins are strings of amino-acids, where the amino-acids are selected from a set of 20 naturally occurring […]