Hi I'm a TKM.
Today, I will introduce some algorithms in Chapter 1 of "Bioinformatics Algorithms".
FREQUENTWORDS
If you want to create a 4-mer pattern, the number of possible patterns would be 4 x 4 x 4 x 4.
1) And if the sequence is "ACGACGACA", 4-mer pattern can be narrowed to 6 patterns:
ACGA, CGAC, GACG, ACGA, CGAC, GACA
Then, the most frequent sequence is 'CGAC' and 'ACGA'.
2) And if the sequence is "ACGAC", 2-mer pattern can be narrowed to 4 patterns:
AC, CG, GA, AC
Then, the most frequent sequence is 'AC'.
By using an algorithm, you can sort and find the most frequent sequence.
First, let's create an INDEX on pattern 2).
for i <- 0 to 3
INDEX(i) <- PATTERNTONUMBER(Pattern)
COUNT(i) <- 1
# Pattern : AC, CG, GA, AC
The function 'PATTERNTONUMBER(Pattern)' produces an index on the pattern like this:
And that's a result of INDEX(i) & COUNT(i) of PATTERN 2.
Then, the next step is sorting the index and combining the COUNT of the same things like this.
In this way, we can return what INDEX(sequence) has the most frequent COUNT.
However, as I have mentioned, some sequences can include mismatches for several reasons.
Therefore, it is essential to develop algorithms that can accommodate mismatches in the DNA sequence.
Let's suppose we have to find the 2-mer pattern 'AG' in ACGAGGAGA, allowing one mismatch.
In order to find the answer more quickly, I utilized ChatGPT.
In the sequence 'ACGAGGAGA', two patterns can be found, position 1 'AC', and position 7 'AG'.
How about two mismatches in the 3-mer pattern 'ACG'?
Like this, we can find hidden messages in DNA through these analyses.
But, unlike these sequences, we generally have to figure out patterns in a large number of alphabets.
Thank you for reading my posts!
I will come with Chapter 2 of the book next time :)