A computer software tool for graphic management of physical mapping data

Lixin Tang1, Jeremy Boulton2, Benjamin Liau2, Hui Zhang3, Wei, Qin4, Sung Ha Huh2, Robert Xuequn Xu1, Yicheng Cao1, Glen A. George2, and Ung-Jin Kim1 *

1Division of Biology, 2Computer Science, 3Atmospheric Chemistry, and 4Electrical Engineering, California Institute of Technology, Pasadena, CA 91125

*Corresponding author

Address correspondence to:

Ung-Jin Kim
147-75, Division of Biology
Caltech, Pasadena,
CA 91208

 

keywords:

physical contig map, software tool, ACEDB

ABSTRACT

Despite remarkable progresses in physical mapping projects and biocomputing, a fully automated process is not yet available for integrated physical mapping and assembling large contigs consisting of markers and a set of clones in which the linear order of the objects, the contiguities and the extents of the overlaps between the clones are indicated. Currently there is no computer program that can understand and resolve the conflicts between different types of experimental data such as STS contents of clones, restriction fingerprint patterns, and the results of various hybridization, and construe an optimal physical map without human intervention. Construction and drawing of physical contig maps as well as management of the mapping data ultimately depends on human decision and elaboration. Frequent updating of complex physical maps and databases by re-drawing graphic maps and re-entering the data associated with each of the map objects according to the changes in the map require tremendous amount of time consuming works. To facilitate drawing, updating, and database entry of physical maps, we have developed a software tool that is capable of reading the content of ACEDB database, allows for graphic display and freehand editing of the physical maps, and dumps the physical map in a file that can be parsed by ACEDB. The program written in C++ called "AceDraw" greatly facilitate physical mapping projects. We demonstrate here the utility of the program in the construction of BAC contigs in human chromosome 16p regions.

INTRODUCTION

Large scale physical mapping projects invariably require constantly organizing, drawing, and updating graphic contig maps as well as other associated data. Physical mapping data can be organized and viewed most comprehensibly as visually presented graphic maps in which the location and order of clones and landmarks and overlaps between clones are displayed along the length of chromosomal regions. There are a number of databases that support management and display of physical mapping data. For instance, ACEDB (http://www.sanger.ac.uk) represents one of the popular database models that suit biological and genomic data management. ACEDB has been modified and adapted for a variety of organisms and purposes in many laboratories. ACEDB database provides GUI tools for the display of various types of data including genetic as well as physical contig maps. Stored markers or clones with associated positional information on chromosomes, clone size, and marker contents are rendered to graphic objects, which are in turn linked to other related data that can be retrieved in popup windows. The entry, modification and updating of data can be done both manually on the individual objects using ACEDB editing tools and by creating a file containing data for multiple objects in the ACEDB readable format (.ace file) and loading it using ACEDB parser. However, the ACEDB graphic display does not allow for drawing and modification of physical mapping data. The map objects can only be created, modified and moved around by typing in new objects or modifying the parameters in textual formats. This tremendously retards daily updating of physical maps as compared with map drawing and modification through intuitive, graphic drawing tools.

Building contig maps involves analysis of a variety of data associated with map objects such as clones and markers. Software tools are available for the automated determination of clone overlaps and contig assembly based on restriction fingerprint patterns (ref), STS contents (ref), and sequence matches ( ref ). However, human intervention is always required to resolve conflicts in the contig assembly. Currently there is no fully automated processes that permits map construction by integrating all of these mapping data without human intervention. In addition to the physical characteristics of the clones, other available biological and mapping data for the map objects, such as FISH mapping and analysis, clone-to-clone hybridization, gene contents, and other annotated information must be considered in drawing a finalized physical contig map. An accurate, integrated map can only be obtained by human judgement and manual drawing. For the manual drawing of the maps, a convenient, freehand drawing software tools are greatly helpful. For this purpose we have been using MacDraw Pro, one of the popular commercial programs for freehand drawing. However, maps drawn in these programs cannot be ported to and from biological databases such as ACEDB, and the map objects are not searchable as they are not linked to database.

To overcome these shortcomings, we have developed a software tool called "AceDraw". As summarized in Figure 1, "AceDraw" has a bidirectional connection to ACEDB database: It can parse the mapping data from ACEDB, but draw more presentable and intuitive contig maps using a large number of colors, with each assigned color associated with certain traits of associated data. It can save the mapping data back into ACEDB format after drawing, modification, and updating, thus allowing using of the special functionalities provided in ACEDB. More importantly, "AceDraw" is a freehand drawing tool that enables easy human intervention to resolve conflicts in the contig map by direct manipulations on clones and markers such as moving, resizing and color changing.

"AceDraw" also allows easy creation, modification and deletion of clones and landmarkers without low-level editing of database files, and thus faciliates map construction. Since "AceDraw" is associated with relational database model, query the database for an object of interest can be done easily. Moreover, "AceDraw" supports map output into high resolution postscript files for colorful, detailed and effective presentation.

RESULTS AND DISCUSSION

1. Oveview of the software (Figure 1)

(1) parsing ACEDB database files

The genomic data for "AceDraw" can be obtained from two resources:

(i) The object creation function in "AceDraw"

(ii) The ACEDB database file parsing function in "AceDraw"

(2) mySQL relational database to store genomic data

All genomic data from above two resources are stored in a backend mySQL database server. mySQL is freely available software. It is a client-server implementation that consists of a server daemon mysqld and many different client programs and libraries. It supports multi-user, multi-threads with the popular SQL as the underlying database language.

(3) Graphical representation of genomic objects (Figure 3)

In "AceDraw", STS sites and clones are represented as color triangles and color bars, respectively, horizontally aligned along the ruler and chromosomal representation of focused region. Bars representing end sequenced clones are with the sequenced end(s) colored darker while bars for unsequenced and fully sequenced clones are in one colors. The region of focus may be changed by moving or resizing the locator. Window zoom may be changed at will. Name of each STS site is displayed at the tip of the triangle while name of each clone is displayed in the middle of the clone bar if the bar is long enough to hold the name. The primary information of visible STS marker or clone (i.e. name and the positional coordinates on chromosome in ACEDB unit) is displayed in special boxes by mouse click. The associated data (such as positional information, clone size, marker contents) can be retrieved in popup windows by double click on the graphical object (STS sites, clones, and chromosome bands). Changes or deletion may be performed through these popup windows.

(4) Freehand drawing of genomic objects (STS sites, clones)

The major goal of "AceDraw" is to facilitate human manipulation on graphical objects while keep the data integrity to a maximal degree. Graphical objects representing STS sites and clones can be moved along the chromosome while the program checks any changes in the relative positive clones or STS sites and asks for confirmation if such changes are detected. Clone bars (except fully sequenced clones) may be resized after confirmations through appropriate dialogs.

(5) Colors in genomic objects

"AceDraw" provides an array of colors for user to select from and assign to selected graphical objects. The color assignment feature in "AceDraw" enables the connection of certain properties assicated with genomic objects with designated colors, and thus gives a clear visual indication about properties of genomic data.

(6) postscript file output

"AceDraw" supports saving the contif map into postscript files for effective presentation. The resolution, output range and other parameters may be specified from a menu. The postscript files contain aligned chromosome bands, ruler calibration, color STS sites and clones.

(7) database search

Unlike other freehand drawing programs such as MacDraw Pro, map objects in "AceDraw" are linked to database and thus searchable. These are two types of objects in "AceDraw" databases, one is "visible" with positions on chromosome assigned and the other "invisible" without assigned chromosomal positions. Both types may be searched through query windows. When an existing "visible" object is searched against the database, the object is centered on screen while the information associated with that object is displayed in popup window through which modifications may be made.

(8) dump back into ACEDB database files

"AceDraw" was designed to have bidirectional interactions with ACEDB package. After map drawing, modification and upating in "AceDraw", the map data can be saved back into database files in ACEDB format so that they may be read back into ACEDB. This allows users to use of other functionalities provided by ACEDB program.

2. Writing the software

"AceDraw" is written in C++ programming language using object-oriented approaches. It runs in the X window environment on several versions of UNIX and Linux. The genomic data for the program is stored in a freely available mySQL database server package using relational database techniques ("http://www.mysql.org/"). "AceDraw" is a front-end graphical user interface (GUI) and mySQL client that sends SQL queries to communicate with mySQL database server for data addition, deletion and updates in response to manipulations on graphical objects by users.

"AceDraw" makes use of the recently developed graphics library "gtk+" (General Image Manipulation Toolkit), the general purpose utility library "glib" , and the "gtk--" library, which provides a convenient C++ interface to the gtk+ graphics routines (Figure 2). All three of the necessary libraries are freely available at http://www.gtk.org/.

"AceDraw" also provides a C++ wrapper of mysql C API to facilitate the interactions of graphics objects and mySQL database objects within "AceDraw" (Figure 2).

3. Using the software (also refer to http://www.ugcs.caltech.edu/~genome/)

(1) System requirements

"AceDraw" program has been developed and run successfully on Sun station running Solaris 2.5.1 (Sparc) and PC-Intel running linux. IRIX/SGI was also tested at an early stage of the project, and is expected to work as well. In theory, the program runs on any system which is supported by the necessary libraries.

"AceDraw" has only been compiled with the GNU C compiler (gcc or egcc), and it is thus recommended that compiler should be used for compilation as well. In order for GNU C compiler to compile the necessary libraries, it is also recommended that GNU "make" should be installed. In addition, the GNU version of m4 is also specifically necessary for compiling the libraries on Sun Solaris.

The mySQL database server may be installed in the same machine that runs "AceDraw" or in a separate server machine. We have used Sun station (Solaris 2.5.1) as mySQL database server. In theory, this database server can be installed into a host of different platforms that are supported by mySQL.

(2) Getting necessary components

i) graphics libraries (glib, gtk+, and C++ wrapper Gtk--):

The graphics libraries can be downloaded from ftp://www.gtk.org, which also contains necessary documentation. It is suggested that glib version 1.1.2 or above, gtk+ version 1.1.1 or above, and Gtk-- version 0.9.14 or above should be used.

ii) database server (mySQL):

The software for installation of mySQL database server and necessary documentation may be found at http://www.mysql.org. It is also suggested that mysql version 3.21 or above should be used.

iii) "AceDraw" source code and compiling:

The compressed source code and Makefile may be obtained from our anonymous ftp site (ftp://ash.tree.caltech.edu). To compile "AceDraw" program, uncompress the source code files and issue "make" in the directory, which will evoke GNU C compiler. Changes in "Makefile" to specify where compiled graphics libararies are located may be necessary.

(3) Building mySQL database server

It is necessary to set up and run a mySQL database server before running "AceDraw" as a client program. This is a two-step procedure in the server machine: (a) set up database name, user names, passwords and priviledges; (b) run mySQL daemon.

There are two ways for the first step: either run a provided script file "mysqldb-config" or use appropriate mySQL commands after run "mysql". The mySQL commands for the latter are easy to use and may be found under http://www.mysql.org/.

Running mySQL daemon is necessary for the server to receive connection and queries from client machines.

(4) Running "AceDraw"

Use command line "acedraw -i[host_name/IP#] -u[username] -p[password] -d[database_name]" to start "AceDraw". Environment variables can be used to replace command line arguments to tell "AceDraw" how to contact the database. For example, following lines in .cshrc file may be used and issuing "acedraw" should start the program.

setenv ACEDRAW_DBHOST "IP_ADDRESS_OF_SQL_SERVER"
setenv ACEDRAW_DBUSER "genome"
setenv ACEDRAW_DBPASSWORD "genome"
setenv ACEDRAW_DBNAME "chrom16"

4. Flexibility

Although "AceDraw" was developed using sample data for human chromosome 16, necessary adjustments have been made so that the program should be able to work with genomic data for other chromosomes without modification.

REFERENCES

Coulson, A., Sulston J., Brenner, S., and J. Karn. 1986. Toward a physical map of the genome of the nematode Caenorhapditis elegans. Proc. Natl. Acad. Sci. USA 83:7821-7825.

Sulston J., Mallet F., Staden R., Durbin R., Horsnell, T., and A. Coulson. 1988. Software for genome mapping by fingerprinting techniques. CABIOS 4:125-132.

Sulston J., Mallet, F., Durbin, R., and T. Horsnell. 1989. Image analysis of restriction enzyme fingerprint autoradiograms. CABIOS 5:101-106.

 

Figure 1 Overview of "AceDraw" and object interactions

 

Figure 2 Libraries used in "AceDraw"

 

Figure 3 The main window screenshot of "AceDraw"

 

Figure 4 ACEDB after reading mapping data from "AceDraw" dump files