Candidate: You must prepare your solution to this programming exercise in advance. The designated platform, on which your solution must be demonstrated and on which it will be marked, is the `gcc' compiler running on `Linux'. If you develop a solution on another platform, it is your responsibility to ensure that it works correctly on the designated platform. Read the information under the [prac' guide], [on C and code modularity] and [plagiarism] links on the subject home page.
Groups: Note that different prac'-groups might be set different tasks. Make sure that you do your task. You will get zero marks if you solve the wrong problem.
The solution programs for this prac' must read from standard-input and write to standard-output. Such programs can also be used with files by using I/O `redirection' as follows
prog.out < inp -- read from file `inp' prog.out < inp > op -- read from inp and write to file `op' prog.out > op -- write to file op- see the Unix/ Linux pages.
You can assume that the input data can be read and stored in memory.
You can use the C library quick-sort routine, or
code your own sort routine, as you wish
(heap sort, merge sort, or quick sort).
The task is to write a C program to find the longest reverse-complementary match in DNA, e.g. in the human globin region [HUMHBB (click)]; the DNA sequence is at the end of the file and starts gaattctaatctccctctcaaccctacagtcacccatttggtatattaaagatgtgttgt...
DNA is double stranded and the four DNA bases {A,C,G,T} are complementary in pairs: A with T, and C with G (case does not matter!). One strand is the complement of the other strand (so the other strand of HUMHBB must be cttaag...) Ignore any character other than {a,A,c,C,g,G,t,T}.
For various reasons,
biologists are interested in any long reversed
match between the two strands
of DNA; this appears as a reversed complementary match within
one strand:
Note that DNA sequences are numbered from
A suffix, s[i..n], of s[1..n] can be represented either by `i' or by a pointer to s[i]. A reversed, complemented prefix, reverse(complement(s[1..j])), of s[1..n] can be represented by `j' or by a pointer to s[j], provided that it is understood that complement(s[j]) is its first base, and complement(s[1]) is its last base. Do not make copies of all these substrings -- they will not fit in memory, but their start locations easily will!
Your program must operate by first sorting
the suffixes and sorting the reverse complementary prefixes
1234567890 ACGGAACCGA -- S, i.e. the input TGCCTTGGCT -- complement S TCGGTTCCGT -- reverse complement S
sorted | ||
---|---|---|
S | rev' comp' S | |
A AACCGA ACCGA ACGGAACCGA CCGA CGA CGGAACCGA <-- S[2..4] GA GAACCGA GGAACCGAi.e. The suffixes of s, in alphabetic order, from s[10..10]=A to s[3..10]=GAACCGA | CCGT CGGTTCCGT <-- S[7..9] CGT GGTTCCGT GT GTTCCGT T TCCGT TCGGTTCCGT TTCCGTi.e. The suffixes of rev'(comp'(s)), in alphabetic order, from CCGT to TTCCGT |
Print out the start and end locations of the longest matching substrings, and the substring (forward version) itself. (Do not print all the suffixes. Do not make copies of all the suffixes.)
Do not use a suffix tree -- it is too complex for this exercise. If the DNA is "fairly random", most string comparisons will quickly end with a mismatch and the algorithm's running time will be close to O(n.log(n)). We now have some very long DNA sequence, e.g. for Plasmodium falciparum which causes [malaria (click)], a disease that kills 1,500,000 to 2,700,000 people each year [-W.H.O. 1997].
Your task
is to write a C program to
implement a simple(!) text compression method
loosely based on the Lempel-Ziv idea that many files
contain repeated substrings. On the second (etc.) repetition of a substring
a "pointer" can be used to indicate the first occurrence.
e.g.
cbadefgpqrstuvhijklpqrstuvmno
can be coded as
cbadefgpqrstuvhijkl<12,7>mno
,
where <12,7> means, count back 12 characters and copy 7 characters.
This is better because <12,7> is six characters standing
for (i.e. coding) pqrstuv's seven.
Note that aaaaaaaaaa
can be coded as
a<1,9>
.
Remember that spaces and newlines are just characters.
You can assume that `<' does not appear in the input text.
The method relies upon finding long repeated substrings in the input text. You must do this by sorting all the suffixes of input text, i.e. text[0..n-1], text[1..n-1], text[2..n-1], ..., text[n-1..n-1].
adefgpqrstuvhijklpqrstuvmno badefgpqrstuvhijklpqrstuvmno ... no o pqrstuvhijklpqrstuvmno <-- long pqrstuvmno <-- match qrstuvhijklpqrstuvmno qrstuvmno ... vhijklpqrstuvmno vmno -- Above: All suffixes of the input in alphabetical order -- -- when e.g. text[] = cbadefgpqrstuvhijklpqrstuvmno --
A suffix can be represented by its start location (as either a char-pointer or an integer). Beware: Do not make copies of the suffixes! If n=106, text[0..n-1] will easily fit in memory, but the total length of the substrings is 0.5×1012, which is enormous! (106 pointers will easily fit.)
e.g. part way through run
output: cbadefgpqrstuvhijkl considering: pqrstuvmno ^ | i ?match? |
Your algorithm must work along the input text from left to right.
Suppose that it has coded text[0..i-1] and is at position `i'
considering text[i..n-1].
Using the sorted suffixes,
find the longest match to
text[i..i+length-1] that starts at position `j'
in text[0..i-1], 0<=j<i; resolve ties in favour of a later substring.
Assume that this is the best candidate for a repeat.
Code text[i..i+length-1] as <i-j,length>
and move on to position i+length, if this results in a saving.
Otherwise emit text[i] and move on to the next position, i+1.
Joel R' has pointed out that the
binary search can in fact be avoided if some extra information is
maintained to indicate where each suffix appears in the sorted collection
-- up to you, your choice. |
You should use binary search to find text[i..n] amongst the collection of sorted suffixes. Any long match to text[i..i+length-1] will occur near text[i..n] in the collection.
If the text is "fairly random", most string comparisons
will quickly end with a mismatch and the algorithm's running time will be
close to O(n.log(n)).
Assessment (both) | |
---|---|
The grades below assume that the candidate understands and can explain the program, i.e. the candidate wrote it. If not... | 0/6 |
Program compiles, runs, inputs a variety of short data, and outputs good but not necessarily optimal solutions. | 3/6 |
Program quickly produces optimal answer on a wide
variety of short test inputs (hundreds of characters). (NB. Maybe 4/6 if "quite slowly" and "very short".) | 5/6 |
Program produces optimal answer on long inputs (10,000 to 100,000+ characters) reasonably quickly. It has been thoroughly tested, and is well written (not necessarily in theStyle) and commented with a sound argument towards its correctness | 6/6 |