All Possible Patterns for an RNA String with Example Grammars

James F. Lynn RNAparse/www.RNAparse.com 2012


*The reader should be familiar with basic RNA secondary structure determination and rudimentary formal grammar theory.

Some elementary formal grammars in BNF-like style are used to show how a string with self-referential properties can be
described with an AFFIX operator that in essence attaches one part of the string to another. In the text, least-energy
methods are for the most part ignored for the sake of clarity and O(n) linearity; Thus, structures and structural components      
accepted (matched) by these methods are “possible” structures and structural components only.    

Linear – Regular Expressions (RE)– Type 3 Grammars
Not much needs to be said about regular expression other than the fact they they are a powerful adjunct to context free
and context sensitive grammars.  
Written as a production an RE may look like this:

S::= 'ATGC' //match ATGC
S::='[AT]TGC' //match ATGC,TTGC
Nested – Context-Free Grammars (CFG) – Type 2 Grammars/Type 2 Grammars with Affix Operator
An RNA stem loop (((......)))

S::= a

a::= 'A' [b] 'T' | 'T' [b] 'A' | 'G' [b] 'C | 'C' [b] 'G' // add as many productions as needed
b::= 'A' [c] 'T' | 'T' [c] 'A' | 'G' [c] 'C | 'C' [c] 'G'
c::= loop
loop::= 'ATGC{6}'
A more efficient way of describing stem-loop structures is by means of an affix operator, rho .

S::= a
a::= $x(“”) // lambda      
(
('T' rho(x<="A" x))| // 'T' followed by x, where 'A' is affixed to x i.e. base pairing.
('A' rho(x<="T" x))|
('G' rho(x<="C" x))|
('C' rho(x<="G" x))
)<3>// number of iterations to take      
loop
x
loop::= 'ATGC{6}' //any RE here
Crossing Structures – Context Sensitive Grammars (CSG) – Type 1 Grammars

H-type Pseudoknot ((( [[[ ))) ]]]

S::= a
a::= $x(“”)
(
('T' rho(x<="A" x))|
('A' rho(x<="T" x))|
('G' rho(x<="C" x))|
('C' rho(x<="G" x))
)<3>

$y(“”)
(
('T' rho(y<="A" y))|
('A' rho(y<="T" y))|
('G' rho(y<="C" y))|
('C' rho(y<="G" y))
)<3>

x y

This method also works for the following: simple compliment repeats and tandem compliment repeats:

Reverse Compliment Repeat vs Simple Repeat:
Just change the grammar rules to fit what's needed.


Attenuation Sequences
are systems of stem and loop sequences that can fold in different ways depending on the presence of certain molecules. They may contain 3 or 4 areas that contain compliment bases:

Segment 1 can form a stem with segment 2. Segment 2 can form a stem with segment 3. Segment 3 and 4 can also form a stem.

S::=       //This can be done in a single production, S.
$left_fragment(
$x("")
(
(  
'A'       rho   (x<="T" x))|
(  
'T'       rho   (x<="A" x))|
(  
'G'       rho   (x<="C" x))|
(  
'C'       rho   (x<="G" x))
)<3> //stem size
)
x
left_fragment
x

// matches ATGCATATGCAT, AAGCTTAAGCTT, AAATTTAAATTT...

Kissing Hairpins, Triple Helical Elements  

Kissing hairpins are pseudoknots with added complexity:

Either of these are also easily handled by our system.
An exact grammar for the first figure above in a single production, kiss, in the form:

nn-a-loop1-b-loop2-a'-loop3-c-loop4-b'-loop5-c'-nnn

kiss  
::=

'[AGCT]{2}' //any 2 nts.
$a(“”)  

(  
('T' rho(a<="A" a))| //left side of stem a  
('A' rho(a<="T" a))|  
('G' rho(a<="C" a))|  
('C' rho(a<="G" a))  
)<5>
'[AGCT]{7}' //loop 1      
$b(“”)
(
('T' rho(b<="A" b))| //left side of stem b
('A' rho(b<="T" b))|
('G' rho(b<="C" b))|
('C' rho(b<="G" b))
)<3>
'[AGCT]{1}' //loop 2
a //right side, stem a'
'[AGCT]{3}' //loop 3      
$c(“”)
(
('T' rho(c<="A" c))| //left side of stem c
('A' rho(c<="T" c))|
('G' rho(c<="C" c))|
('C' rho(c<="G" c))
)<4>
'[AGCT]{2}' //loop 4
b //right side, b'
'[AGCT]{8}' //loop 5
c //right side, stem c'
'[AGCT]{3}' //any 3 nts.

//end


On close inspection the second figure above contains 3 tandem compliment base pairs, each one with crossing properties (e.g. AGT-TCA) but can be handled similarly (grammar not shown.)

Discussion
RNA, in these cases is considered an abstract string with a 4-letter alphabet that is unable to form true knots. Here, only a few simple bonding rules are applied – this is not always the case in real RNA where wobble pairs and other types of unusual base pairing may exist. None the less, we show that no matter how complex a pattern is, all secondary patterns that RNA can form are reduced to combinations of the two configurations; nested and crossing. The actual grammars used to parse real RNA secondary structures can be quite complex and nuanced, containing rules that allow for least-energy configurations such that computation structures more closely match natural structures. These grammatical methods could, in theory, also describe strings of any size alphabet that fold in the same basic ways such as polypeptides- more rules are simply added to the existing framework.

An Example of a Grammar-only “Least Energy” Filter for Stem-Loop Structures.  

Consider ((((::::)))). Filling in the parenthesis with nucleotides that fit we can begin with
AAAAxxxxTTTT, GGGGxxxxCCCC, AGAAxxxxTTCT and so on such that any of these are accepted by the above example grammar. In fact, only a few of these configurations have enough strength through weak hydrogen bonding to exist as a natural structure such as GGGGxxxxCCCC. Thus, some set of filters must be set to weed out weaker configurations. One way is to check the final match against a rule set that's known to work. For a stem of length 4 there are n^r or 256 possible combinations, 81 such rules that work leaving 256 – 81 = 175 that do not work:

Adding the hypothetical filter rule “F” to the CS stem-loop grammar:

S::= F a //match F and a rules
a::= 'A' b 'T' | 'T' b 'A' | 'G' b 'C' |'C' b 'G' //GU pairs written in if desired.
b::= 'A' c 'T' | 'T' c 'A' | 'G' c 'C' |'C' c 'G'
c::= 'A' d 'T' | 'T' d 'A' | 'G' d 'C' |'C' d 'G'
d::= 'A' L 'T' | 'T' L 'A' | 'G' L 'C' |'C' L 'G'
L::= '[ATCG]{4}'
F  
::=       ˆ   ( //do not match any of the following weak stem componets:
'AAAA'   |   'AAAT'   |   'AAAG'   |   'AAAC'   |   'AATA'   |   'AATT'   |   'AATG'   |   'AATC'   |   'AAGA'   |
'AAGT'   |   'AAGG'   |   'AAGC'   |   'AACA'   |   'AACT'   |   'AACG'   |   'AACC'   |   'ATAA'   |   'ATAT'   |
'ATAG'   |   'ATAC'   |   'ATTA'   |   'ATTT'   |   'ATTG'   |   'ATTC'   |   'ATGA'   |   'ATGT'   |   'ATGG'   |
'ATGC'   |   'ATCA'   |   'ATCT'   |   'ATCG'   |   'ATCC'   |   'AGAA'   |   'AGAT'   |   'AGAG'   |   'AGAC'   |
'AGTA'   |   'AGTT'   |   'AGTG'   |   'AGTC'   |   'AGGA'   |   'AGGT'   |   'AGCA'   |   'AGCT'   |   'ACAA'   |
'ACAT'   |   'ACAG'   |   'ACAC'   |   'ACTA'   |   'ACTT'   |   'ACTG'   |   'ACTC'   |   'ACGA'   |   'ACGT'   |
'ACCA'   |   'ACCT'   |   'TAAA'   |   'TAAT'   |   'TAAG'   |   'TAAC'   |   'TATA'   |   'TATT'   |   'TATG'   |
'TATC'   |   'TAGA'   |   'TAGT'   |   'TAGG'   |   'TAGC'   |   'TACA'   |   'TACT'   |   'TACG'   |   'TACC'   |
'TTAA'   |   'TTAT'   |   'TTAG'   |   'TTAC'   |   'TTTA'   |   'TTTT'   |   'TTTG'   |   'TTTC'   |   'TTGA'   |
'TTGT'   |   'TTGG'   |   'TTGC'   |   'TTCA'   |   'TTCT'   |   'TTCG'   |   'TTCC'   |   'TGAA'   |   'TGAT'   |
'TGAG'   |   'TGAC'   |   'TGTA'   |   'TGTT'   |   'TGTG'   |   'TGTC'   |   'TGGA'   |   'TGGT'   |   'TGCA'   |
'TGCT'   |   'TCAA'   |   'TCAT'   |   'TCAG'   |   'TCAC'   |   'TCTA'   |   'TCTT'   |   'TCTG'   |   'TCTC'   |
'TCGA'   |   'TCGT'   |   'TCCA'   |   'TCCT'   |   'GAAA'   |   'GAAT'   |   'GAAG'   |   'GAAC'   |   'GATA'   |
'GATT'   |   'GATG'   |   'GATC'   |   'GAGA'   |   'GAGT'   |   'GACA'   |   'GACT'   |   'GTAA'   |   'GTAT'   |
'GTAG'   |   'GTAC'   |   'GTTA'   |   'GTTT'   |   'GTTG'   |   'GTTC'   |   'GTGA'   |   'GTGT'   |   'GTCA'   |
'GTCT'   |   'GGAA'   |   'GGAT'   |   'GGTA'   |   'GCAA'   |   'GCAT'   |   'GCTA'   |   'GCTT'   |   'CAAA'   |
'CAAT'   |   'CAAG'   |   'CAAC'   |   'CATA'   |   'CATT'   |   'CATG'   |   'CATC'   |   'CAGA'   |   'CAGT'   |
'CACA'   |   'CACT'   |   'CTAA'   |   'CTAT'   |   'CTAG'   |   'CTAC'   |   'CTTA'   |   'CTTT'   |   'CTTG'   |
'CTTC'   |   'CTGA'   |   'CTGT'   |   'CTCA'   |   'CTCT'   |   'CGAA'   |   'CGAT'   |   'CGTA'   |   'CGTT'   |
'CCAA'   |   'CCAT'   |   'CCTA'   |   'CCTT'  
)

Towards Protein Secondary Structure using Context-Sensitive Grammars

James F. Lynn RNAparse/www.RNAparse.com March 2012

While amino acid interactions (proteins) are more complex than nucleotide (RNA) interactions, the underlying principles can still be based on similar context-free grammatical production predicates. The secondary shapes of proteins topologically equivalent to the stem-loop and pseudoknot structures found in RNA.


The figure above represents a toy protein where the helix is held into its shape at 12 points, lettered from a to l.

Projected to a flat topology map (below) the crossing properties of each connection becomes more clear: a and c connect across b and so forth. The second set of letters represents how the points of connection are rewritten to represent a point and its compliment (compliment denoted by the prime strike.) Thus: a connects to a' across b, c to c' across b' and d...

If we continue adding amino acids to our toy protein to the point in which the chain may fold back upon the helix and interact with it the topology becomes more complex with an increase of crossing interactions. None the less, the topology is described in the same way.

or [a b a' c b' d e c' f g e' g f' h g' h' g' d']

We can begin to see a few very long distant amino acid interactions that cross other interactions.

To grammatically describe such a shape as above turns out to be remarkably easy if we rewrite each connection point as having a compliment point; a-a', b-b',c-c'...

where points a = a,

b = b,
c = a',
d = c',
e = b', …

The Algorithm for defining the protein shape describes just two things, a point and its compliment and the placement of each point and its compliment:


To describe the second, more complex helix:


let a be the compliment of a',
let b be the compliment of b',
let c be the compliment of c',
let d be the compliment of d',
let e be the compliment of e',
let f be the compliment of f',
let g be the compliment of g',
let h be the compliment of h'

such that a b a 'c b' d e c' f g e' g f' h g' h' g' d'

A very simple grammar for the second helix is expressed in a single predicate: where “rho <=” represents an AFFIX operator that builds a complement from its corresponding point.

my_predicate::=


$A("")((
'a'   rho (A<="c" A))) //accept the empty string. Make “a” the compliment of “c”

//if we wanted to add amino acids to skip, its done here as a regular expression.

$B("")((
'b'   rho (B<="e" B)))

A //”A” becomes the compliment of “a”, “c”

$C("")((
'd'   rho (C<="h" C)))

B

$D("")((
'f'   rho (D<="r" D)))

$E("")((
'g'   rho (E<="k" E)))

C

$F("")((
'i'   rho (F<="m" F)))

$G("")((
'j'   rho (G<="q" G)))

E

$H("")((
'l'   rho (H<="o" H)))

F

$I("")((
'n'   rho (I<="p" I)))

H
I
G
D

//end


Contact   jlynn@acsalaska.net
If you wish further information, source code, or built .exe's, contact me.
Copywrite 2012 all rights reserved save by permission by James F. Lynn
aaaaaaaaaaaaiii