Prac 3, CSE2304, CSSE, Monash University

Prac 3, CSE2304, CSSE, Monash, Semester 1, 2002

Group A: week 8, 29 April - 3 May,
Group B: week 9, 6 - 10 May

Candidate: You must prepare your solution to this programming exercise in advance. The designated platform, on which your solution must be demonstrated and on which it will be marked, is the `gcc' compiler running on `Linux'. If you develop a solution on another platform, it is your responsibility to ensure that it works correctly on the designated platform. Read the information under the [prac' guide], [on C and code modularity] and [plagiarism] links on the subject home page.

Groups: Note that different prac'-groups might be set different tasks. Make sure that you do your task. You will get zero marks if you solve the wrong problem.

The solution programs for this prac' must read from standard-input and write to standard-output. Such programs can also be used with files by using I/O `redirection' as follows

   prog.out < inp         -- read from file `inp'
   prog.out < inp > op    -- read from inp and write to file `op'
   prog.out > op          -- write to file op

- see the Unix/ Linux pages.

You can assume that the input data can be read and stored in memory.
You can use the C library quick-sort routine, or code your own sort routine, as you wish (heap sort, merge sort, or quick sort).

Group A

The task is to write a C program to find the longest reverse-complementary match in DNA, e.g. in the human globin region [HUMHBB (click)]; the DNA sequence is at the end of the file and starts gaattctaatctccctctcaaccctacagtcacccatttggtatattaaagatgtgttgt...

DNA is double stranded and the four DNA bases {A,C,G,T} are complementary in pairs: A with T, and C with G (case does not matter!). One strand is the complement of the other strand (so the other strand of HUMHBB must be cttaag...) Ignore any character other than {a,A,c,C,g,G,t,T}.

For various reasons, biologists are interested in any long reversed match between the two strands of DNA; this appears as a reversed complementary match within one strand: i.e. s[i..i+length-1]=reverse(complement(s[j..j+length-1])). (`i' and `j' can be equal or unequal.)
Note that DNA sequences are numbered from position `1' up!

A suffix, s[i..n], of s[1..n] can be represented either by `i' or by a pointer to s[i]. A reversed, complemented prefix, reverse(complement(s[1..j])), of s[1..n] can be represented by `j' or by a pointer to s[j], provided that it is understood that complement(s[j]) is its first base, and complement(s[1]) is its last base. Do not make copies of all these substrings -- they will not fit in memory, but their start locations easily will!

Your program must operate by first sorting the suffixes and sorting the reverse complementary prefixes of S. It must then scan the sorted collections in ascending order searching for the longest match between a suffix and a reversed complemented prefix.

1234567890
ACGGAACCGA  -- S, i.e. the input
TGCCTTGGCT  -- complement S
TCGGTTCCGT  -- reverse complement S

sorted
S		rev' comp' S
A AACCGA ACCGA ACGGAACCGA CCGA CGA CGGAACCGA <-- S[2..4] GA GAACCGA GGAACCGA i.e. The suffixes of s, in alphabetic order, from s[10..10]=A to s[3..10]=GAACCGA		CCGT CGGTTCCGT <-- S[7..9] CGT GGTTCCGT GT GTTCCGT T TCCGT TCGGTTCCGT TTCCGT i.e. The suffixes of rev'(comp'(s)), in alphabetic order, from CCGT to TTCCGT

Print out the start and end locations of the longest matching substrings, and the substring (forward version) itself. (Do not print all the suffixes. Do not make copies of all the suffixes.)

Do not use a suffix tree -- it is too complex for this exercise. If the DNA is "fairly random", most string comparisons will quickly end with a mismatch and the algorithm's running time will be close to O(n.log(n)). We now have some very long DNA sequence, e.g. for Plasmodium falciparum which causes [malaria (click)], a disease that kills 1,500,000 to 2,700,000 people each year [-W.H.O. 1997].

Group B

Your task is to write a C program to implement a simple(!) text compression method loosely based on the Lempel-Ziv idea that many files contain repeated substrings. On the second (etc.) repetition of a substring a "pointer" can be used to indicate the first occurrence.
e.g. cbadefgpqrstuvhijklpqrstuvmno can be coded as cbadefgpqrstuvhijkl<12,7>mno, where <12,7> means, count back 12 characters and copy 7 characters. This is better because <12,7> is six characters standing for (i.e. coding) pqrstuv's seven.
Note that aaaaaaaaaa can be coded as a<1,9>. Remember that spaces and newlines are just characters. You can assume that `<' does not appear in the input text. [Question: How would you deal with it if it did?]

The method relies upon finding long repeated substrings in the input text. You must do this by sorting all the suffixes of input text, i.e. text[0..n-1], text[1..n-1], text[2..n-1], ..., text[n-1..n-1].

   adefgpqrstuvhijklpqrstuvmno
   badefgpqrstuvhijklpqrstuvmno
   ...
   no
   o
   pqrstuvhijklpqrstuvmno    <-- long
   pqrstuvmno                <-- match
   qrstuvhijklpqrstuvmno
   qrstuvmno
   ...
   vhijklpqrstuvmno
   vmno

   -- Above: All suffixes of the input in alphabetical order --
   -- when e.g. text[] = cbadefgpqrstuvhijklpqrstuvmno       --

A suffix can be represented by its start location (as either a char-pointer or an integer). Beware: Do not make copies of the suffixes! If n=10⁶, text[0..n-1] will easily fit in memory, but the total length of the substrings is 0.5×10¹², which is enormous! (10⁶ pointers will easily fit.)

e.g. part way through run

output: cbadefgpqrstuvhijkl
considering:               pqrstuvmno
                           ^
                           |
                           i ?match?

Your algorithm must work along the input text from left to right. Suppose that it has coded text[0..i-1] and is at position `i' considering text[i..n-1]. Using the sorted suffixes, find the longest match to text[i..i+length-1] that starts at position `j' in text[0..i-1], 0<=j<i; resolve ties in favour of a later substring. Assume that this is the best candidate for a repeat. Code text[i..i+length-1] as <i-j,length> and move on to position i+length, if this results in a saving. Otherwise emit text[i] and move on to the next position, i+1.

Joel R' has pointed out that the binary search can in fact be avoided if some extra information is maintained to indicate where each suffix appears in the sorted collection -- up to you, your choice.

You should use binary search to find text[i..n] amongst the collection of sorted suffixes. Any long match to text[i..i+length-1] will occur near text[i..n] in the collection.

If the text is "fairly random", most string comparisons will quickly end with a mismatch and the algorithm's running time will be close to O(n.log(n)). Do not use a suffix tree - which would guarantee a faster algorithm, but is too complex for this exercise.

Assessment (both)
The grades below assume that the candidate understands and can explain the program, i.e. the candidate wrote it. If not...	0/6
Program compiles, runs, inputs a variety of short data, and outputs good but not necessarily optimal solutions.	3/6
Program quickly produces optimal answer on a wide variety of short test inputs (hundreds of characters). (NB. Maybe 4/6 if "quite slowly" and "very short".)	5/6
Program produces optimal answer on long inputs (10,000 to 100,000+ characters) reasonably quickly. It has been thoroughly tested, and is well written (not necessarily in theStyle) and commented with a sound argument towards its correctness	6/6

Created with "vi (Linux & IRIX)", charset=iso-8859-1

Prac 3, CSE2304, CSSE, Monash, Semester 1, 2002

Group A: week 8, 29 April - 3 May, Group B: week 9, 6 - 10 May

Group A

Group B

Group A: week 8, 29 April - 3 May,
Group B: week 9, 6 - 10 May