Project SynopsisProject Requirements as stated in class.
Description (Background and goals): The Human Genome Project is an international project led by the Sanger Center in the UK, and the National Institutes of Health in the US. It's goal is to completely map out the human genome by DNA sequencing by the year 2005. As the human mitochondrial genome (16.6 kb) has already been completely sequenced, the genome under investigation is the nuclear genome. The entire genome is about 3 billion base pairs long; each chromosome is on the order of hundreds of millions of base pairs in size. A base is one of the 4 nucleotides: adenine (A), cytosine (C), guanine (G), or thymine (T). In the helical structure of the DNA molecule, these bases come in pairs. Adenine matches with thymine, and cytosine matches with guanine. This DNA sequence of A, T, G, or C (3 billion of them) can be translated into proteins. Each gene codes for one protein. There are about 65,000-80,000 genes. The genome is broken up into 24 different chromosomes, numbered 1-22 plus the X and Y sex chromosomes. Dr. Kim's Lab is currently working on sequencing human Chromosome 16, which is about 100 million base pairs. The difficulty in sequencing comes in part from the fact that, in order to look at the sequence, you need not one copy of it (which could be found in almost any human cell), but millions of copies of it (given current sequencing technology) and the only way to get millions of copies of it is to cut the 100 million base pair long chromosome up into thousands of tiny pieces using partial restriction digestions, ligate them into vectors, transform these pieces into E. Coli bacteria, allow the positively-transformed bacteria to multiply into clones (they copy the foreign human DNA along with their own when they reproduce), and then extract the crop of freshly replicated strands of human DNA for analysis. The problem is, no one is really sure how all these thousands and thousands of tiny segments of DNA fit back together to form the original chromosome. To make matters more difficult, the restriction enzymes that are used as chemical "scissors" to chop the DNA up into pieces, do not cut at random intervals (if they did, it would be relatively easy, with a large enough sample size, to be assured of continuously overlapping segments (contigs) for the entire length of the chromosome), they only cut at certain special sites, leading to a non-random distribution of pieces, and meaning that even with a very very large sample size, the researchers can only be sure of covering about 95% of the sequence. Also, the actual sequencing technology (where they read the A, C, G, and T sequence in) is still very primitive, and can only read (at best) 500 base pairs at a time. Since most of the strands (called "clones") are thousands to tens of thousands of bases long, there are a lot of sequences that have just their ends sequenced, and their middles unknown. These clones are called "shotgun" clones (because you get a random pattern of sequenced chunks all across the clone). A few clones are fully sequenced, many have yet to be sequenced at all. Some nothing is known except the approximate length of the clone by agarose gel electrophoresis. But all is not lost. You can think of the chromosome as a kind of one dimensional map, and on this map there are some landmarks. Through a process called fluorescence in situ hybridization (also known as FISH), researchers are able to introduce fluorophore markers into some of the clones, and allow them to hybridize with chromosomal DNA, then it is relatively easy to look at the chromosome and see where fluoresencent bits ended up under UV irradiation, thus giving at least a very general idea of where the clone fits into the chromosome. There are also Bacterial Artifical Chromosomes (BACs), which were invented here at Caltech. These special clones may be as small as about 500 base pairs long, making sequencing them relatively easy. There are also very specific landmarks scattered (again, non-randomly) throughout the genome. They are called sequence tagged sites (also called STS sites). Most of these sites are unique hundreds of base pair long sequences. That is to say, if you find this sequence, you know exactly where you are in the genome. There are about 30,000 of these STS sties. So it seems that the basic unit of data in this project is the clone. Each clone has its own catalog number, and has a set of data associated with it: how long it is, how much of it has been sequenced (if any), and that sequence, what STS sites it contains, where it was placed by FISH mapping, etc., etc. All of this information is contained in c16db, which is a modified version of the ACeDB (A. C. Elegans DataBase, named after the worm which was one of the first lucky organizms to have its genome sequenced, you'll notice file suffixes like .wrm in the source, and files called things like w1, w2, w3, etc, all this comes from "worm"). This is the database that we are going to be working with. In our "genome" account at date.tree.caltech.edu, if you look at the .ace files in c16db/rawdata/*, and c16db/rawdata/c16_ace_files/*, you will see the format in which this information is organized. c16db reads in and parses these .ace files, and stores this information in its own binary format. It is also capable of dumping back out to .ace file format. When the researchers want to enter new data, they currently use a text editor to hand alter the database files. They would like to be able to do this in a more efficient, less tedious way. Our specific focus right now is to modify/develop a GUI for graphical data entry. c16db can graphically display physical contig maps, but all the data has to be hand entered, and there is a lot of data. What this means is, every 2 weeks or so, some poor grad student has to sit down and slave away at the keyboard for a day doing tedious data entry to bring the database up to speed. What Dr. Kim (and others, this kind of database is used all over the world in human and other genome mapping projects) would like to be able to do (as far as the GUI goes) is color code the clones based on their characteristics, create new entries by clicking and draging to draw the clone, allow dynamic re-arranging of the clones onscreen (with the appropriate updates in the database when you move something) Most of the information they have on these clones isn't exact, so they would like to be able to enter it into the database in a similarly fuzzy way. They'd also been thinking it would be nice if the data could be output directly (graphically) over the web. We don't have to deal with the actual (mathematically messy) problem of trying to fit this huge linear jigsaw puzzle back together. The NIH has many computers dedicated to only that, with gigabytes of memory, cranking away day and night. Glen George has suggested a couple of ways that we might go about doing this project. Either by rewriting the code that already exists in C++, and adding features as we go along, as well as expansion capability. Or by building a smaller suite of helper programs that store their own data in their own format and read in the c16db data, and deal with display. But because we are going to have to dynamically alter the database, this second option might turn out to be just as complex as the first. The code that currently exists is written in C, and is built (apparently) to deal with Mac, and PC platforms too (at least some of the display stuff is). Dr. Kim claims that the code is already somewhat object oriented, as is the database. So re-writing might not be as crazy as it sounds (there are, according to Matt Doucleff, about 150,000 lines of code already written) Unfortunately, with the vast amout of code we already have, I suspect that the program is doing a lot that we don't really understand yet. If anyone has any insight into what that might be, it would be great to hear from you.
Inputs and Outputs: Much of our work involving inputs and outputs is defining exactly how the user will talk to the system. The User Interface specification covers that in detail. However, there are many programming issues involved with handling the input of data before that input is interpreted, reading data files, updating the screen, and handling the final data output. The input and output specifications detail our approach to these problems. As far as program design goes, most of the functionality described here will be spread throughout the system. The database subsystem will know how to handle it's input and output files, and the GUI system will deal with interactive events. It is only convenient to consider these topics as a seperate system since they are logically related. Inputs to the program come from two sources; the User Interfaceand the genome database files, see Database Requirements. The inputs are categorized into two groups: Interactive Inputs Interactive user input will be entirely through the GUI, as described in the User Interface specification. These operations require both keyboard and mouse inputs allowing the use of GUI objects such as buttons, menu bars, and scrollbars, as well as hot keys to zoom, scroll, select various display objects, and activate various program functions.
The inputs must be handled using the default X Toolkit input mechanism, which typically involves the registration of function callbacks and ``actions'' to respond to particular types of input events. Each widget defines it's own set of automatic responses when it receives an event, eliminating much of our work. Although the Toolkit's widget hierarchy provides many different generic widget types, such as scrollbars and buttons, but we may find it useful to define our own widget's in order to get the various behaviors we need.
Outputs from the program go to two sources; the display and the database files. For more information on these read the User Interface specification and the the Database Requirements spec. The outputs are categorized into two groups: Screen Display Much of the work for this project involves creating a useful user interface. We plan on using the X Toolkit exclusively for all of our graphics. This ensures that our program will run on nearly any X workstation, and also greatly reduces the amount of coding we'll need to do. Much of the drawing will be handled by widgets such as buttons, menus, etc., which the X Toolkit takes care of for us. The other graphics, such as the genetic map, will require our own design. Even then we'll rely upon the utilities provided by X to do most of the drawing for us. We may choose to define new X widgets for objects like clones and STS sites. This would allow us to encapsulate both drawing and event handling into the same object in the code. These are most the details we need to consider for the display. The X event handling mechanism handles ``house-keeping'' events such as window-hiding and resizing for us. File Outputs Initially, our program reads in two database files as described in the inputs section. The user then modifies with this data and adds any new data via the User Interface. The database is in charge of keeping track of the information as it is created or modified. When everything is finished, the data must be written back out to files. These files will be in the same format as those read (ACE database format), containing the modified information of course.
User Interface General The user interface will consist of one X-window which will take both mouse and text input, and give the user text and graphical output. Object Selection To select an object the user left clicks on it, to display all relevant information about the object, the user right clicks on the object Left/Right fields In the lower right hand corner of the window there will be two small text fields, one displaying the left coordinate of the currently selected object, one displaying the right coordinate of the selected object. There will be increment and decrement buttons associated with each of the text fields, allowing the user to change the value of the left or right coordinates of the selected object directly. The user can also just delete the current number and enter a new one. For STS markers, and other 1 coordinate objects, both left and right will display the same number and be linked (ie you can't change the value in one without changing the value in the other as well). The files containing the left and right endpoints of the objects are contained in: file://date.tree.caltech.edu/data0/genome/c16db/rawdata/c16db-all* Chromosome Display The top portion of the window will contain a graphical representation of the chromosome being mapped, arranged horizontally (similar to the c16db graphic). This graphic will display the banding of the entire chromosome, and the locations of the centromere and telomeres. At the very top of the window will be a ruler giving the coordinates in ACeDB units. The different bands in the chromosome will be labeled. Zoom/Scroll Bar Below the graphic of the chromosome there will be a scroll/zoom bar. If you click on the endpoints of the bar, you can stretch it to make the zoom field bigger or smaller. If you click in the middle of the bar, you can drag it to the left and to the right, to change the part of the chromosome that you are viewing. You can also use the left/right fields in the lower right corner to change the zoom factor and displayed location when the zoom/scroll bar is the selected object. STS and other marker sites Below the zoom/scroll bar there will be another display of the chromosome, showing only the section that is currently zoomed, and the labels of the bands that show up there. Below the zoomed chromosome graphic, there will be a ruler in ACeDB units, scaled to the zoomed area. On that ruler will be the STS and other marker sites, demarkated by a small triangle, and the catalog number of that marker. If you select the STS site, you can alter its coordinates just like all the other objects. Clone Display The remaining portion of the window will be taken up by the actual display of the clones as lines. They too will be displayed horizontally, each with its associated catalog number, but only when you are zoomed in close enough for there to be room for the display of the catalog number (so you don't get alphabet soup when you're zoomed all the way out). When you select a clone, you can alter it's left/right coordinates in the text boxes at lower right. The clones are color coded, 3 distinct colors, one for completely sequenced, one for "shotgun" clones (partially sequenced) and one for unsequenced. If the ends of the clones are sequenced, they show up as a "bright" color, otherwise they show up grey. Ideally, we would like to have the tool look at the sequences of the ends of the clones, and align them itself, but this is not planned in the current implementation. Query box In the lower left hand corner of the display, there is a query box, into which the user can type a catalog number to search for. If the object is found, it is highlighted, and the display centers on it, (if it is a clone, it zooms in far enough that the catalog number is displayed, if you weren't already zoomed in far enough). If the catalog number is not found, it pops up a dialog box ("Unable to find XXXXXXXX, please check the number and try again.") Pop-up menus If you click the right button, you get a pop up menu. In it we have:
Printing gives color output (if possible) and prints only the chromosome graphic and clone view sections of the display.
Error Handling: Errors may occur at almost all stages of the program. Programmer's errors will be handled by extensive assertions in the code. Below we consider errors at the user interface and file interface and discuss our handling methods.
Database Requirements:
A dynamic list of BAC Clones will be maintained (read from .ace
files). The BAC Clones list will be searchable by position and
name. The BAC Clone will store information regarding name,
center position, size, re-sizeability (yes/no), Color (for
output), Markers contained in the clone, In-situ position and
ACE database remark fields. Both the BAC Clone list and the STS
Marker list will be editable (in place). Searches will return a
database object (lists of objects) with reference semantics. The
Database will be stored in ACE files. A description of ACE
database format files is Appendix A (or see:
http://probe.nalusda.gov:8000/acedocs/syntax.html )
The only algorithm we would implement is that for searching the database to find the information needed or save the modified information. The clone name is in the form of "A-###B##", where "A" and "B" are for any letter and "###' and "##" are a three-digit and a two-digit number, respectively. We need to sort the database in alphabetical order of clone names. Search by clone name could be done fast by mergesort algorithm.
There are basically two major objects in our code: clones
(we will focus on BAC clones and ignore YAC clones) and markers
(genetic markers, STS markers, EST markers, etc.), with the goal of
matching BAC clones to markers. They are implemented using class in C++.
The data members for BAC clone class would include:
The data members for marker class would include:
* The DNA sequence data for either clones or markers will simply be in the form of one-dimensional array. The array size is determined by the sequence length. If the DNA is not sequenced or not completely sequenced, the array size is set to MAX_INT (even though the approximate sequence length can be deduced from
gel electrophoresis).
There are certainly some limitations to our program as following:
|