Reference Sequence Formats

This appendix explains how breseq handles different reference sequence formats. Specifically, where the properties of different annotations are read in, how they are internally classified, and how they may be converted.

Illegal Characters

For all sequence formats:

  1. In nucleotide sequences, all characters are converted to uppercase and all non [ATCG] characters are converted to [N].
  2. In gene names and locus tags, the characters [,;/|] are replaced with [_].
  3. In gene descriptions, the character [|] is replaced with [;].

Types

breseq does special handling of repeat sequences for predicting transposon insertions. These must have a type of

Repeats repeat_region, mobile_element

CDS, tRNA