Introduction

Tandem, dispersed and higher-order repeats were previously studied using restriction enzyme analysis. On the other hand, scientific literature falls short of providing a direct computational identification and analysis of higher-order repeats from GenBank data sequences. Given the fast growth of sequence databases in the centromeric region, it is of increasing interest to have efficient tools for such computational analysis.

KSA method is "cutting" a given genomic sequence into HORs, and could be considered as using computationally simulating restriction enzymes. The computational key-string is chosen acording to a convenient combination of nucleotides. At the same time, we detected within alpha satellites, regardless of whether they are organized into HORs or not, some fixed strings of nucleotides which are robust with respect to mutations. Such strings are, for example, CAAA, GTTT, TTTC, and TTTT. Therefore, they are convenient for the key strings to detect alpha satellites in the centromere of every chromosome, revealing simultaneously HORs if they are present. The alignment of alpha satellites from the same HOR reveals the consensus length of HORs.

We develop a few methods for fast computational identification and analysis of higher order repeats (HORs) in a given genomic sequence, without requiring a priori information on composition of genomic sequence. Methods are extension of the key-string algorithm (KSA).