Project Synopsis


Project Requirements as stated in class.



Description (Background and goals):

The Human Genome Project is an international project led by the Sanger Center in the UK, and the National Institutes of Health in the US. It's goal is to completely map out the human genome by DNA sequencing by the year 2005. As the human mitochondrial genome (16.6 kb) has already been completely sequenced, the genome under investigation is the nuclear genome. The entire genome is about 3 billion base pairs long; each chromosome is on the order of hundreds of millions of base pairs in size. A base is one of the 4 nucleotides: adenine (A), cytosine (C), guanine (G), or thymine (T). In the helical structure of the DNA molecule, these bases come in pairs. Adenine matches with thymine, and cytosine matches with guanine. This DNA sequence of A, T, G, or C (3 billion of them) can be translated into proteins. Each gene codes for one protein. There are about 65,000-80,000 genes. The genome is broken up into 24 different chromosomes, numbered 1-22 plus the X and Y sex chromosomes. Dr. Kim's Lab is currently working on sequencing human Chromosome 16, which is about 100 million base pairs.

The difficulty in sequencing comes in part from the fact that, in order to look at the sequence, you need not one copy of it (which could be found in almost any human cell), but millions of copies of it (given current sequencing technology) and the only way to get millions of copies of it is to cut the 100 million base pair long chromosome up into thousands of tiny pieces using partial restriction digestions, ligate them into vectors, transform these pieces into E. Coli bacteria, allow the positively-transformed bacteria to multiply into clones (they copy the foreign human DNA along with their own when they reproduce), and then extract the crop of freshly replicated strands of human DNA for analysis. The problem is, no one is really sure how all these thousands and thousands of tiny segments of DNA fit back together to form the original chromosome. To make matters more difficult, the restriction enzymes that are used as chemical "scissors" to chop the DNA up into pieces, do not cut at random intervals (if they did, it would be relatively easy, with a large enough sample size, to be assured of continuously overlapping segments (contigs) for the entire length of the chromosome), they only cut at certain special sites, leading to a non-random distribution of pieces, and meaning that even with a very very large sample size, the researchers can only be sure of covering about 95% of the sequence. Also, the actual sequencing technology (where they read the A, C, G, and T sequence in) is still very primitive, and can only read (at best) 500 base pairs at a time. Since most of the strands (called "clones") are thousands to tens of thousands of bases long, there are a lot of sequences that have just their ends sequenced, and their middles unknown. These clones are called "shotgun" clones (because you get a random pattern of sequenced chunks all across the clone). A few clones are fully sequenced, many have yet to be sequenced at all. Some nothing is known except the approximate length of the clone by agarose gel electrophoresis. But all is not lost.

You can think of the chromosome as a kind of one dimensional map, and on this map there are some landmarks. Through a process called fluorescence in situ hybridization (also known as FISH), researchers are able to introduce fluorophore markers into some of the clones, and allow them to hybridize with chromosomal DNA, then it is relatively easy to look at the chromosome and see where fluoresencent bits ended up under UV irradiation, thus giving at least a very general idea of where the clone fits into the chromosome. There are also Bacterial Artifical Chromosomes (BACs), which were invented here at Caltech. These special clones may be as small as about 500 base pairs long, making sequencing them relatively easy. There are also very specific landmarks scattered (again, non-randomly) throughout the genome. They are called sequence tagged sites (also called STS sites). Most of these sites are unique hundreds of base pair long sequences. That is to say, if you find this sequence, you know exactly where you are in the genome. There are about 30,000 of these STS sties.

So it seems that the basic unit of data in this project is the clone. Each clone has its own catalog number, and has a set of data associated with it: how long it is, how much of it has been sequenced (if any), and that sequence, what STS sites it contains, where it was placed by FISH mapping, etc., etc. All of this information is contained in c16db, which is a modified version of the ACeDB (A. C. Elegans DataBase, named after the worm which was one of the first lucky organizms to have its genome sequenced, you'll notice file suffixes like .wrm in the source, and files called things like w1, w2, w3, etc, all this comes from "worm"). This is the database that we are going to be working with.

In our "genome" account at date.tree.caltech.edu, if you look at the .ace files in c16db/rawdata/*, and c16db/rawdata/c16_ace_files/*, you will see the format in which this information is organized. c16db reads in and parses these .ace files, and stores this information in its own binary format. It is also capable of dumping back out to .ace file format. When the researchers want to enter new data, they currently use a text editor to hand alter the database files. They would like to be able to do this in a more efficient, less tedious way.

Our specific focus right now is to modify/develop a GUI for graphical data entry. c16db can graphically display physical contig maps, but all the data has to be hand entered, and there is a lot of data. What this means is, every 2 weeks or so, some poor grad student has to sit down and slave away at the keyboard for a day doing tedious data entry to bring the database up to speed. What Dr. Kim (and others, this kind of database is used all over the world in human and other genome mapping projects) would like to be able to do (as far as the GUI goes) is color code the clones based on their characteristics, create new entries by clicking and draging to draw the clone, allow dynamic re-arranging of the clones onscreen (with the appropriate updates in the database when you move something) Most of the information they have on these clones isn't exact, so they would like to be able to enter it into the database in a similarly fuzzy way. They'd also been thinking it would be nice if the data could be output directly (graphically) over the web. We don't have to deal with the actual (mathematically messy) problem of trying to fit this huge linear jigsaw puzzle back together. The NIH has many computers dedicated to only that, with gigabytes of memory, cranking away day and night.

Glen George has suggested a couple of ways that we might go about doing this project. Either by rewriting the code that already exists in C++, and adding features as we go along, as well as expansion capability. Or by building a smaller suite of helper programs that store their own data in their own format and read in the c16db data, and deal with display. But because we are going to have to dynamically alter the database, this second option might turn out to be just as complex as the first. The code that currently exists is written in C, and is built (apparently) to deal with Mac, and PC platforms too (at least some of the display stuff is). Dr. Kim claims that the code is already somewhat object oriented, as is the database. So re-writing might not be as crazy as it sounds (there are, according to Matt Doucleff, about 150,000 lines of code already written) Unfortunately, with the vast amout of code we already have, I suspect that the program is doing a lot that we don't really understand yet. If anyone has any insight into what that might be, it would be great to hear from you.

Back to Top


Inputs and Outputs:

Much of our work involving inputs and outputs is defining exactly how the user will talk to the system. The User Interface specification covers that in detail. However, there are many programming issues involved with handling the input of data before that input is interpreted, reading data files, updating the screen, and handling the final data output.

The input and output specifications detail our approach to these problems.

As far as program design goes, most of the functionality described here will be spread throughout the system. The database subsystem will know how to handle it's input and output files, and the GUI system will deal with interactive events. It is only convenient to consider these topics as a seperate system since they are logically related.

Inputs

Inputs to the program come from two sources; the User Interfaceand the genome database files, see Database Requirements. The inputs are categorized into two groups:

Interactive Inputs Interactive user input will be entirely through the GUI, as described in the User Interface specification.

These operations require both keyboard and mouse inputs allowing the use of GUI objects such as buttons, menu bars, and scrollbars, as well as hot keys to zoom, scroll, select various display objects, and activate various program functions.

Event Handling

All screen objects will be drawn using the standard X Toolkit widgets. This was an important design decision with implications for data input, and we realize it is not without its drawbacks. The decision to use the X Toolkit restricts our interactive input handling options to those provided by the Toolkit. However, what we lose in terms of flexibility we gain in pre-fabricated event handling. Free implementations of the standard X libraries are available.

The inputs must be handled using the default X Toolkit input mechanism, which typically involves the registration of function callbacks and ``actions'' to respond to particular types of input events. Each widget defines it's own set of automatic responses when it receives an event, eliminating much of our work. Although the Toolkit's widget hierarchy provides many different generic widget types, such as scrollbars and buttons, but we may find it useful to define our own widget's in order to get the various behaviors we need.

Mouse Events

Typical X workstation mice have three buttons. These buttons provide a convenient method of selecting multiple operations for a given object on the screen. In addition the program will implement the default X Toolkit mouse ``dragging'' handlers. The User Interface specification will detail what the various mouse clicks and drags will do.

Outputs

Outputs from the program go to two sources; the display and the database files. For more information on these read the User Interface specification and the the Database Requirements spec. The outputs are categorized into two groups:

Screen Display Much of the work for this project involves creating a useful user interface. We plan on using the X Toolkit exclusively for all of our graphics. This ensures that our program will run on nearly any X workstation, and also greatly reduces the amount of coding we'll need to do. Much of the drawing will be handled by widgets such as buttons, menus, etc., which the X Toolkit takes care of for us. The other graphics, such as the genetic map, will require our own design. Even then we'll rely upon the utilities provided by X to do most of the drawing for us. We may choose to define new X widgets for objects like clones and STS sites. This would allow us to encapsulate both drawing and event handling into the same object in the code.

These are most the details we need to consider for the display. The X event handling mechanism handles ``house-keeping'' events such as window-hiding and resizing for us.

File Outputs Initially, our program reads in two database files as described in the inputs section. The user then modifies with this data and adds any new data via the User Interface.

The database is in charge of keeping track of the information as it is created or modified. When everything is finished, the data must be written back out to files. These files will be in the same format as those read (ACE database format), containing the modified information of course.

Back to Top


User Interface

General

The user interface will consist of one X-window which will take both mouse and text input, and give the user text and graphical output.

Object Selection

To select an object the user left clicks on it, to display all relevant information about the object, the user right clicks on the object

Left/Right fields

In the lower right hand corner of the window there will be two small text fields, one displaying the left coordinate of the currently selected object, one displaying the right coordinate of the selected object. There will be increment and decrement buttons associated with each of the text fields, allowing the user to change the value of the left or right coordinates of the selected object directly. The user can also just delete the current number and enter a new one. For STS markers, and other 1 coordinate objects, both left and right will display the same number and be linked (ie you can't change the value in one without changing the value in the other as well). The files containing the left and right endpoints of the objects are contained in: file://date.tree.caltech.edu/data0/genome/c16db/rawdata/c16db-all*

Chromosome Display

The top portion of the window will contain a graphical representation of the chromosome being mapped, arranged horizontally (similar to the c16db graphic). This graphic will display the banding of the entire chromosome, and the locations of the centromere and telomeres. At the very top of the window will be a ruler giving the coordinates in ACeDB units. The different bands in the chromosome will be labeled.

Zoom/Scroll Bar

Below the graphic of the chromosome there will be a scroll/zoom bar. If you click on the endpoints of the bar, you can stretch it to make the zoom field bigger or smaller. If you click in the middle of the bar, you can drag it to the left and to the right, to change the part of the chromosome that you are viewing. You can also use the left/right fields in the lower right corner to change the zoom factor and displayed location when the zoom/scroll bar is the selected object.

STS and other marker sites

Below the zoom/scroll bar there will be another display of the chromosome, showing only the section that is currently zoomed, and the labels of the bands that show up there. Below the zoomed chromosome graphic, there will be a ruler in ACeDB units, scaled to the zoomed area. On that ruler will be the STS and other marker sites, demarkated by a small triangle, and the catalog number of that marker. If you select the STS site, you can alter its coordinates just like all the other objects.

Clone Display

The remaining portion of the window will be taken up by the actual display of the clones as lines. They too will be displayed horizontally, each with its associated catalog number, but only when you are zoomed in close enough for there to be room for the display of the catalog number (so you don't get alphabet soup when you're zoomed all the way out). When you select a clone, you can alter it's left/right coordinates in the text boxes at lower right. The clones are color coded, 3 distinct colors, one for completely sequenced, one for "shotgun" clones (partially sequenced) and one for unsequenced. If the ends of the clones are sequenced, they show up as a "bright" color, otherwise they show up grey. Ideally, we would like to have the tool look at the sequences of the ends of the clones, and align them itself, but this is not planned in the current implementation.

Query box

In the lower left hand corner of the display, there is a query box, into which the user can type a catalog number to search for. If the object is found, it is highlighted, and the display centers on it, (if it is a clone, it zooms in far enough that the catalog number is displayed, if you weren't already zoomed in far enough). If the catalog number is not found, it pops up a dialog box ("Unable to find XXXXXXXX, please check the number and try again.")

Pop-up menus

If you click the right button, you get a pop up menu. In it we have:

  • Printing
  • Zoom (choose from predefined generic zoom: x2, x4, x8)
  • Create/Delete objects.
    • clones
    • STS markers
  • Save
  • we suspect that there will be other items in the pop-up menus, but don't know what they are just yet.
When creating a clone, the user is given a dialog box that contains a form to fill out. The fields in this form are:
  • clone name
  • left/right endpoints
  • sequencing progress (done, in progress, or none)
  • remarks
  • other fields as yet to be determined
When you Save, the changes that have been tenatively made with the tool, are updated in the database.

Printing gives color output (if possible) and prints only the chromosome graphic and clone view sections of the display.

Back to Top


Error Handling:

Errors may occur at almost all stages of the program. Programmer's errors will be handled by extensive assertions in the code. Below we consider errors at the user interface and file interface and discuss our handling methods.

  1. To protect the integrity of the database, the program will only allow authorized person to modify information in the database, i.e., only those people who have an entry in the passwd.wrm file could have write access. When an unauthorized user attempts to modify the database, a window containing "no authorization" information will appear and the program won't let do the operations.

  2. To keep the integrity of the database, our program should also check at the outset to enforce that exactly only one user is working on the database at one time. If any other user is already working on the database, the program should give current user a warning and terminate.

  3. For modification or for search purposes, if the user inputs (1) a clone or marker name or (2) a start or end (i.e. left or right) position or (3) center position or (4) clone size or (5) remarks via user interface, all these inputs must always match the data type specified in our code. For example, a clone or marker name must be a string starting with a letter, and a left, right, center position, or clone size must be a float number (in ACeDB units, i.e. 1kbp). If an input error is detected, a dialog window box will pop up and give user choices of "Retry or Cancel".

  4. The left, right, or center position for a clone or marker must be a float number between some specified range for that certain chromosome. For example, the range is from -40,000 to 60,000 (ACeDB units) for the chromosome 16. If the left, right, or center position during program searching or user inputs is a number outside this range, an error message will be displayed and the user will have to choose between "Retry or Cancel".

  5. When a user attempts to resize a clone bar on screen, the program will check the database whether or not the specific BAC clone is resizeable. If it is not resizeable, i.e., the clone has already been completely sequenced and the clone bar should be colored in red, the user will be informed that it can not be resized even though it may still be movable.

  6. During the addition of new clones, if the clone spans across other markers that do not belong to the clone, the user should be reminded that either the markers' positions may be incorrect or the unsequenced part of that clone indeed contains these markers. And then the user will be given three choices as following:
    • move the markers around to change their positions
    • resize the clone bar
    • leave the positions for markers and clone as where they are now.

  7. If the user clicks on the dull colored segment of a shotgun clone, which means the end sequence data is not found in dump.ace, he or she will be informed that no sequence data available for the clicked clone segment".

  8. If the user clicks on an object on the screen with no default action, a warning will be issued. Any other default mouse errors will be handled by invoking the standard X libraries.

  9. Illegal keyboard strokes will be handled by standard X toolkit.

  10. If any file open or close error occurs during the program, it will be handled by calling the standard file I/O library functions.

  11. If any dynamic memory allocation error occurs, the program will terminate.

Back to Top


Database Requirements:

The Database consists of two separate lists of genetic objects. A dynamic list of STS Markers will be maintained (read from .ace files). The STS Marker list will be searchable by position and name. The STS Marker will store information regarding name, center position, size, color (for output) and any ACE database remark fields.

A dynamic list of BAC Clones will be maintained (read from .ace files). The BAC Clones list will be searchable by position and name. The BAC Clone will store information regarding name, center position, size, re-sizeability (yes/no), Color (for output), Markers contained in the clone, In-situ position and ACE database remark fields. Both the BAC Clone list and the STS Marker list will be editable (in place). Searches will return a database object (lists of objects) with reference semantics. The Database will be stored in ACE files. A description of ACE database format files is Appendix A (or see: http://probe.nalusda.gov:8000/acedocs/syntax.html )

Objects

  • Markers
    Name
    Center Position
    Size
    Color
    Ace database fields (remarks)

  • Clones
    Name
    Center Position
    Size
    Re-sizeable (yes/no)
    Color
    Ace database fields (remarks)
    Marker(s) it belongs to.
    In-situ position

Access

Random access by name or position.
Position/Color and Size of Clones editable
Sorted by position.
Searchable by Marker
Searches return Databases themselves.

Back to Top


Algorithms:

The only algorithm we would implement is that for searching the database to find the information needed or save the modified information. The clone name is in the form of "A-###B##", where "A" and "B" are for any letter and "###' and "##" are a three-digit and a two-digit number, respectively. We need to sort the database in alphabetical order of clone names. Search by clone name could be done fast by mergesort algorithm.

Back to Top


Data Structures:

There are basically two major objects in our code: clones (we will focus on BAC clones and ignore YAC clones) and markers (genetic markers, STS markers, EST markers, etc.), with the goal of matching BAC clones to markers. They are implemented using class in C++.

The data members for BAC clone class would include:

  1. clone name (catalog number)
  2. left end position
  3. right end position
  4. center position (which can be deduced from (2) and (3))
  5. clone size (this can also be deduced from (2) and (3). But this is usually determined by Gel electrophoresis)
  6. resizeability (if clone size is known either by gel electrophoresis or by complete sequencing as parsed from the database remark field, it is unresizeable; otherwise, resizeable)
  7. color (if completely sequenced, red; if shotgun clone, blue; onhold, green)
  8. remarks
  9. positive STS markers
  10. In situ mapping (FISH) position
  11. clone's DNA sequence data *

The data members for marker class would include:

  1. marker name
  2. center position
  3. marker size (since marker is usually several hundred basepairs, it is convenient to use center position and size instead of left and right position).
  4. color (depending on sequence methods, laboratories, etc.)
  5. marker's DNA sequence data *

* The DNA sequence data for either clones or markers will simply be in the form of one-dimensional array. The array size is determined by the sequence length. If the DNA is not sequenced or not completely sequenced, the array size is set to MAX_INT (even though the approximate sequence length can be deduced from gel electrophoresis).

Back to Top


Limitations:

There are certainly some limitations to our program as following:

  1. Our program does not automatically find the positions to put clones by doing the DNA sequence matching and alignment based on the DNA sequence known for those clones and DNA sequence known for that specific chromosome. However, we can envision that this can be overcome by coupling our program with some commonly used database searching programs (such as FASTA or BLAST - Basic Local Alignment Search Tool).

  2. Our program does not deal very much with internet applications. These include publicizing our genomic database to internet public interface so scientists all over the world can search or display certain genomic data and contig maps from our database.

  3. It does not deal with pictorial and sound information associated with certain DNA sequences, what the restriction sites each clone has, what gene each clone belongs to, what protein it would perdict, how much evolutionary conservation DNA sequences for certain genes have from E. Coli, to C. elegans, and to man, etc.

Back to Top


Back to Main Page