DVS

Overview

White Paper

Motivation

One of the most popular version control systems in use today is CVS. Developers often complain of several shortcomings in the design of CVS; for example, files cannot be moved or renamed without losing version metadata. DVS addresses weaknesses in CVS while maintaining backwards compatibility as much as possible.

The DVS Model

The DVS Model improves on CVS by tracking directory metadata in addition to file metadata. Each directory is represented in the repository as a contents file, which contains a list of the files in the directory and their versions. Committing changes to a directory is simply committing a new version of the directory's contents file.

This metadata is sufficient to restore any previous state of a souce tree. If a file is removed, the prior version of the directory records the presence of the file. When a file is renamed, the prior version of the directory records the prior name.

Additionally, when a directory is checked out from DVS, everything from that directory down in the file system tree forms a consistent snapshot of that set of data files.

Three types of operations benefit from keeping a record of the sequential changes made to directories. First, consider renaming or moving a file. A rename operation only requires changing the local name in the contents file for the appropriate directory (See "Contents Files" below). A move operation only requires removing the file's entry from one contents file and placing it in another contents file.

Second, consider deletion and restoration of a file. When a file is deleted, it is removed from the contents file but not from the repository. The contents file versions show precisely when the file was deleted. A file can be restored by updating the contents file. There is no data or metadata that is lost or that needs to be located.

Third, consider the restoration old software configurations. If a file in a branch is deleted in CVS, its status as part of that set may be lost. In contrast, DVS records all configurations -- including which files are present and their versions -- so recreating an old configuration is possible even if files have been moved or deleted.

Another benefit from this model is that it easily accounts for the common development scenario in which different subsystems are concurrently developed, perhaps by different people. DVS makes it easy to use different versions of subsystems as independent entities. For example, each developer could use the latest versions of her own files but the latest stable versions of all of the subsystems on which her project depends.

It is also possible to make other subsystems appear as subdirectories of the current project, explicitly capturing the dependence relation between the projects. CVS has no mechanism for this.

DVS is designed to retain as much backward compatibility with CVS as possible, and the interface presented to the user mimics that of CVS (see Command-based Interface" below). DVS does not require any migration process to use an existing CVS repository. DVS can use them directly, adding it's own features as an overlay on top of CVS. Further, CVS users can still use the repository even after some developers have started using DVS, because DVS only adds functionality, nothing is removed or replaced. Obviously, DVS' features would not be available to CVS users.

DVS does not use a database other than plain text RCS files, because if for any reason the administrator needs to modify the repository directly, the task is much more difficult for having used a database. Further, DVS is designed with simplicity in mind - rather than being a full software configuration management system, DVS handles versioning of projects and leaves the developer free to use any suitable build tools. DVS can, of course, version any files used to control the build process, such as makefiles.

Related Work

ClearCase

ClearCase integrates version control directly into the file system by providing a custom file system for storing files to be versioned. Working directories are presented as views, which must be customized by the developer to select the desired files and versions to display. All files are kept remotely on the ClearCase server and before a file is modified it is copied to the local machine. When changes are committed they are propagated back to the ClearCase server.

Like DVS, ClearCase versions directories. After modifying files, the containing directory needs to be committed as well so that ClearCase knows which file versions go together.

While the view shows the current version of each file, other versions may be selected explicitly using a hierarchical naming system. For example, test.cc@@/main/bugfix/3 represents version 3 of test.cc on the bugfix branch.

Checking files out typically requires the files to be locked and they are unlocked when the developer either releases them or commits changes. ClearCase also supports the copy-modify-merge process used in systems such as CVS and its derivatives.

Although DVS and ClearCase have similar models, their implementations differ greatly. One drawback to ClearCase is the potentially high network overhead. Files often reside on a remote machine even when being used locally (namely files that are read but not modified). When developers are spread across a large geographic area this can lead to very poor performance, as it requires a great deal of communication between the client and the server.

CVS

CVS is based on RCS which uses plain text files to store data in a repository. CVS versions files but not directories.

One major shortcoming of CVS is that directory version metadata is not captured. When files are deleted, CVS can't reconstruct a prior source tree that included that file because it doesn't record versions of trees. This limits the usefulness of the repository as a whole.

Another shortcoming is that moving or renaming files requires the administrator to manually edit the repository to avoid losing all version history and metadata for the file being moved or renamed. This inhibits project growth and evolution, because very often data files change in use over time or are renamed or moved to a new location. Requiring the administrator to manually edit the repository decreases productivity, but being restricted to a source layout that no longer reflects the organization of the project also decreases productivity.

Perforce

Like CVS, Perforce uses RCS files for storing revision updates. It uses a different mechanism than RCS for branching however [PFIFB]. Perforce does not version directories but has some other means for restoring previous directory states.

The white paper for Perforce's branch model, Inter-File Branching [PFIFB], presents a difficulty in data representations used by version control systems. In particular it stresses a difference in treatment of variant names and file names, pointing out that deltas are only merged between variants of single files and that only one variant is presented to the user at a time (such as in the filesystem). This results in two independent naming systems for the data stored in the repository. However, this rarely presents a problem in practice.

Sourcesafe

Sourcesafe's model maps projects and subprojects to directories. Every project corresponds to a directory, but not all directories are projects. Data files are stored in a SQL database. Directories do not have versions, but sufficient version metadata is recorded to allow reconstruction of prior source trees.

The combination of this model and its implementation is very restrictive: files may only be moved from a project to a parent project; directories can't be moved at all, and moving projects loses metadata.

The implementation imposes some arbitrary limits such as upper bounds of 8000 files in a project and 15 levels of nested subprojects. While many projects will not exceed these limits, large projects will be inhibited in their growth, and reorganization will be difficult.

Further, storing data files in a SQL database limits the ability of developers and administrators to fix problems in the repository should they arise, and special tools are needed to extract the data from the database, for example for backups.

Subversion

Subversion presents a model in which the repository has versions rather than individual files. This allows easy reconstruction of any previous source tree. Subversion's repository uses a Berkeley DataBase system for storing data files.

This is counter intuitive because when a file changes, the latest version for all files is incremented. For example, rather than having version 4 of foo.c, one can have foo.c as it appeared in version 4 of the repository. But files are usually developed as independent entities, and developers thus think of different versions of their files rather than different versions of the whole repository.

Vesta

Vesta is a complete software configuration management system. Access to files is provided through NFS, and a scripting language is provided for defining how software is built including dependencies, compilers, etc. Vesta uses reserved checkouts - the next version is locked until the developer who checked it out releases it or commits changes. Vesta stores immutable snapshots of the working directory, and committing changes is committing the latest snapshot, at which point the reservation is removed and the new versions are available for other developers to use. Directories are not versioned, but prior source trees can be reconstructed.

DVS Structure

The DVS repository is a directed acyclic graph (DAG) in which directory nodes have children and metadata but no data of their own and file nodes which have no children or metadata but instead have data. A subdirectory appears as a data file in its parent directory and has a version like other data files, but it also has its own set of data files that it contains.

Directories as a Graph

A collection of files and directories in DVS is represented as a DAG rather than the traditional, more restrictive tree model. Each file and directory is represented as a node in the graph, and the graph has an edge from node d to node f, if directory d contains f in the filesystem tree.

While the file system where checked out files are placed will usually still impose use of a tree for the directory structure, a subdirectory in a project may appear in arbitrarily many places provided that no cycles are introduced into the graph. This does not replace branches, which are still required for concurrent development.

Relaxing the model from a tree to a DAG allows for subsystems to conveniently and efficiently depend on one another, as any directory appearing in the project in multiple places is stored only once in the repository. Additionally, this facilitates dependencies on different versions of a subsystem without sacrificing consistency. For example, two different projects might use different versions of the same library, which can be accomplished by making the library a subdirectory of both.

Contents Files

The following grammar describes the syntax of DVS contents files:

         contents              ::= entry_list
         entry_list            ::= <empty> | entry
         entry_list entry      ::= LOCAL_NAME UID TYPE VERSION <newline>

LOCAL_NAME is the name used in the working directory. This name is not required to be unique. For example, two files may each be named foo.c if they're in different directories. For each of these files, foo.c is the local name.

UID is a file's unique identifier used in the repository, never removed, and never reused. Metadata is tracked for each UID regardless of how many places the file appears in the source tree the DVS constructs. See below for a more detailed discussion of this field and its use.

TYPE is one of "file," "dir," or "add." The first two indicate the file type in the obvious sense. "Add" indicates to DVS that the file is a recent addition and thus not yet in the repository. All add operations in DVS are local until committed, at which time all additions are processed in a batch. At commit time, the type of such a contents entry is changed to either "file" or "dir." Contents files in the repository use only these two types, the "add" type appears only in the local version of a directory before commit.

VERSION is the version number of the file. These version numbers are chosen in the same way that RCS version numbers are chosen [RCS].

Example Contents File:

           file1.c src/subsys1/test.c    file 1.3
           file2.c src/subsys2/util.c    file 1.8
           file3.c src/subsys2/util0.c   file 1.2
           libfoo  src/libs/foo          dir  1.2
           newfile src/subsys1/newtest.c add  1.1

This example shows several properties of contents files:

All entries are sorted alphabetically by the LOCAL_NAME field.
Since DVS allows files to be removed, resurrected, and moved, name conflicts are possible. The name of a file in the repository is used as the unique name for a file. See "UIDs" for a detailed treatment of these names.
The file "newfile" is a recent addition - this file is not yet present in the repository, but it will be added when "newfile" is committed or the directory is committed.
The files in the directory represented by this contents file are not all in the same directory in the repository. DVS grants the developer this degree of freedom by abstracting the working directory from the repository's arrangement of files.

UIDs

As mentioned above, UIDs record a file's true identity in the repository. These identifiers must be unique, of course, and once a file is added to DVS with a particular UID, that UID is never removed or reused, and the corresponding file is never deleted from the repository. This presents a layer of indirection in that the file names DVS shows to the user need not be globally unique, and after a file is deleted, a new and unrelated one may be created with the same name without creating a name conflict with the previous file.

Because DVS is implemented as a layer on top of CVS, the expected UID is a relative path in the CVS sandbox. Using a back end other than CVS may suggest or require a different scheme for UIDs.

Choosing names

In order to insure that names are unique but still comprehensible, in case a repository administrator wishes to work directly with the repository, the first choice for a UID is the LOCAL_NAME prefixed by the path from the top of the project tree in the filesystem. In the absence of any name conflicts, this is clearly the best choice and makes the repository easy to maintain. However, given that files may be deleted, moved, renamed, etc. it is expected that name conflicts will arise. In such a case, a natural number is appended to the name to make it unique. These numbers are chosen sequentially beginning with 1, and the lowest unused number is chosen. Because files are not removed from the repository, a used number will never become available.

A special value ("_") is used for files that are scheduled for addition to the repository (those that have type "add"). This is necessary because in the general case choosing a valid name is not possible without contacting the (possibly remote) repository, so the correct UID is not known.

The indirection afforded through this mechanism facilitates the preservation of metadata across operations such as remove and restore because metadata is attached to the UID rather than the LOCAL_NAME for each file DVS tracks, allowing the LOCAL_NAMEs seen by the user to change arbitrarily as suits the project at hand.

Operations on Contents Files

Overview of Behaviors

initialize	Generate contents files for any directories already present, perform "add" operation for each directory or normal file already present. This is used for creating a new project, and once the repository has been created, initialize is not used.
checkout <obj>	Read contents file to find UID for <obj>, then fetch <obj> from the repository. This is a top-down recursive procedure. If <obj> is a normal file, no recursion takes place.
remove <obj>	Removes <obj> from its parent directory's contents file. This is never a recursive procedure.
update <obj>	The latest version of the specified object is fetched from the repository for comparison with the current instance. If the object doesn't exist locally, it is created. Otherwise, DVS notes whether it is modified or unmodified. If the specified object is a directory, this is a top-down recursive procedure beginning with the directory's contents file. After updating the contents file, each node in the DAG listed in the contents file is updated.
add <obj>	Adds an entry to the contents file for the containing directory. Note that this entry will have a special type and a special UID to mark it as a new addition, and both of these fields will be modified when the containing directory is committed.
link <src> <dest>	Creates a contents entry for <dest> having the same UID and version as <src>. No new file is created and <src> must already exist.
commit <obj>	Compares <obj> to the latest version of <obj> in the repository. If there are conflicts (which can only happen in concurrent developement where <obj> has been changed and committed to the repository since the local copy was checked out), the conflict is noted and must be resolved before the local changes are reflected in the repository.

Examples

Sample repository:

Sample Source Tree:

b
|__d1.2 . . . . . . d1.3
|  |___e1.1 . . . . |__e1.2
|      |___f1 1.1 . . . |___f1 1.1
|      |___f2 1.1 . . . |___f2 1.2
|      |___f3 1.1 . . . |___f3 1.3
|__c1.7
|  |___m1.1
|  |___n1.3
| ...

b
|__d1.2
|  |__e1.1
|     |___f1 1.1
|     |___f2 1.1
|     |___f3 1.1
| ...

1.3 is the latest committed version of directory d.

In the sample repository, the left tree in the repository represents the contents files, while the right tree represents latest versions, and the "sample source tree" represents the files checked out. Thus an operation such as "update" from within directory b would fetch d1.2, e1.1, and so on down the tree based on the contents files. But "update d" would fetch d1.3 and then follow its contents file.

In the source tree, the files f1, f2, and f3 may be different from the versions in the repository. Assume that they were checked out as the versions specified for them, but do not assume that they have been changed or that they have not.

Examples of how contents files change

Let NCC denote checking for conflicts and notifying the user if any are found.

Supposed we're in directory e from the above examples.

Operation	Actions
update d	NCC; d1.3 brought to sandbox
update f2	NCC; f2 1.2 brought to sandbox
update	NCC; d1.2, e1.1, ... used for update

checkout e	e1.1 checked out
checkout e 1.2	e1.2 checked out

Note that none of these operations makes changes to any contents file.

move f1 f5 (requires modifying e1.1 contents)
	f1 /path/to/f1 file 1.1 f2 /path/to/f2 file 1.1 f3 /path/to/f3 file 1.1	==>	f2 /path/to/f2 file 1.1 f3 /path/to/f3 file 1.1 f5 /path/to/f1 file 1.1

modify f1 and commit:
	f1 /path/to/f1 file 1.1 f2 /path/to/f2 file 1.1 f3 /path/to/f3 file 1.1	==>	f1 /path/to/f1 file 1.2 f2 /path/to/f2 file 1.1 f3 /path/to/f3 file 1.1

remove f1:
	f1 /path/to/f1 file 1.2 f2 /path/to/f2 file 1.1 f3 /path/to/f3 file 1.1	==>	f1 /path/to/f1 removed 1.2 f2 /path/to/f2 file 1.1 f3 /path/to/f3 file 1.1

link ../c/m .
	f1 /path/to/f1 file 1.2 f2 /path/to/f2 file 1.1 f3 /path/to/f3 file 1.1	==>	c: no change e: f1 /path/to/f1 file 1.2 f2 /path/to/f2 file 1.1 f3 /path/to/f3 file 1.1 m /path/to/m file 1.1

Suppose we wish to commit new versions to e. This is an example of how a contents file changes when a directory is committed:

d's contents:
	e /path/to/e dir 1.1 ...	==>	e /path/to/e dir 1.2 ...

DVS Implementation

Repository API

Structures used:

	Contents	Represents a single contents file
	ContentsEntry	Represents one entry in a contents file
	Repository	Represents an entire DVS source tree

	Only the Repository API is exposed. The Contents and ContentsEntry instances are hidden and they are mutually recursive.

Public Members: none
Public Methods:

void init (File rdir, File ddir)

Initializes a directory as a new DVS source tree. Argument rdir is the location in the file system where files will be stored when checked out of the (possibly remote) repository. Files are created under this directory using their UIDs, and no file occurs twice in the file system under rdir. Argument ddir is the directory in the file system where DVS will create the files that appear in a project. Directories are created per contents file and data files are created by local name under this directory.

DVS is currently implemented as a layer on top of CVS. As such, rdir is a CVS sandbox and every file in the CVS repository corresponds to precisly one UID listed in the contents files DVS keeps. One such file may occur multiple times under ddir depending on the DVS operations performed. CVS's metadata directories, called "CVS" and stored in each directory in a CVS module, are never recreated by DVS, they exist only in the CVS sandbox.

void addFile (File file)

AddFile adds a file to the repository. The specified file must be present in the source tree. An entry is created in the appropriate contents file (having type "add").

void linkFile (File src, File dest)

LinkFile links src, which must already be tracked by DVS, into the new location dest. This consists of duplicating the contents entry for src into the contents for dest. If src and dest specifiy different file names, the name for dest is used in the entry added to dest's contents entry. The UID, TYPE and VERSION fields of src's contents entry are always used in the new entry for dest.

void removeFile (File file)

RemoveFile removes a file from DVS' revision tracking. The file must be present in the source tree. Its contents entry is removed and future checkout operations that don't explicitly name this file will not retrieve it. If the file noted by the UID is present in other locations, those locations are uneffected.

void updateFile (File file)

UpdateFile updates the specified file from the DVS repository. This entails fetching the file for comparison. If no changes have been made in the repository, the process stops. Otherwise, if no changes have been made locally, the local version is modified to reflect all changes from the version in the repository. If both have changed and CVS can merge them, it does, otherwise markers are placed in the file showing where the two versions conflict.

Because DVS is implemented on top of CVS, DVS depends on CVS for most of this work. CVS' update operation is performed on the UID file, and changes in the repository are propagated to the local DVS version. Conflicts detected by CVS are noted by DVS.

void tagFile (File file, String tag)

TagFile tags the specified file with the given symbolic tag. A CVS tag operation is performed on the file. It is an error to attempt to tag a file which has been modified from the version that was checked out.

Private Methods and Members

Repository

String fileToUID (File file)
FileToUID returns the UID corresponding to the specified file.
Contents
Each Contents object corresponds to a directory node in the DAG for the project.
ContentsEntry

CVS Back end

Command Structure

DVS uses a structure, class Command, which contains a command, a vector of arguments, and a directory. Excution of a command uses java's exec method, passing all of this information and using a null environment (thus using the environment passed to DVS), since CVS often relies on evironment variables. A simple threaded class is used to obtain output from the command during execution and relay it to DVS' output streams.

Command API

Public Members: none
Public Methods:

public int execute ()

Execute executes the command. This entails creation of two StreamGobbler objects, one for each output stream, executing the command, and waiting for termination. Execute's return value is the exit status of the command executed.

StreamGobbler is a class which simply reads all data from an output stream (its input stream) and passes it to its own output stream [STRGBL].

public String toString ()

ToString returns a string representation of this command. This is the concatenation of all of the arguments to the command, separated by space characters.

public void addArg (String arg)

AddArg appends the given string to this command's vector of arguments to be passed at execution time.

Private Methods:

private String[] toArray (): ToArray returns an array representation of this command. This is the format used by java's exec() method - each element of the returned array is one token of the command to be executed. Exec() is not given a string representation (such as that returned by toString()) because java would split the string at whitespace characters and the tokens are permitted to have embedded whitespace.

User Interface

In general, UI implementations will use the public API of a repository object. The DVS UI API presented here is used to acquire a repository object.

UI API

public Repository getRepository (File root): GetRepository returns a repository object representing the DVS source tree rooted at root. It is an error if there is no repository rooted there.

Command-based Interface

In the spirit of CVS, DVS' general command syntax is the following.

DVS [DVS arguments] <command> [command arguments]

Formats are specified below in "Formats for Arguments."

DVS Arguments

--help	Displays a summary of DVS use
--version	Displays version of DVS
--repository <where>	Specifies the location of the repository
--pending	Show pending operations defered until next commit

DVS Commands and their Arguments

add			Schedules a file or directory for addition to the DVS repository
	dvs add <files>
		files	List of files to add

checkout			Retrieves files from the repository for viewing or editing
	dvs checkout [-r revision] [-D date] [-d dir] [-j rev] modules
		-r revision	Get the specified revision
		-D date	Get latest version committed as of the specified date [2]
		-d dir	Root working directory at given directory instead of module name
		modules	A list of modules to retrieve [3]

commit			Commit to the repository changes between the current versions of files and the versions checked out
	dvs commit [-F logfile] [-m message] [-r revision] [files]
		-F logfile	Read commit log message from the specified file
		-m message	Use the specified message in the commit log
		-r revision	Commit to the specified branch
		files	Which files' changes to commit. If none are specified, the current directory is used.

diff			Show differences between versions
	dvs diff [-r r1 r2] [-D date1 date2] [files]
		-r r1 r2	Compare the two specified versions
		-D date1 date2	Compare the latest versions committed as of the specified dates [2]
		files	Which files to compare. If none are specifed, the current directory is used.

import			Import source tree into a DVS repository
	dvs import [directory]
		directory	Top directory to consider for import. If not specified, the current directory is used.

log			Display log information for files
	dvs log [-r Revisions] [files]
		-r Revisions	Display logs for the specified revision(s) [1]
		files	Files whose logs will be displayed. If none is specified, the current directory is used.

move			Renames a file or moves it to another directory
	dvs move <source> <dest>
		source	File to move (possibly including a path)
		dest	New name for source (possibly including a path)

	Note that the move procedure is only permitted on files which are synchronized with the DVS repository. This restriction is necessary because when a file is checked out, modifed, then moved, it's not always clear which versions should be listed in the contents files after the move operation.

remove			Remove a file from DVS
	dvs remove <files>
		files	Which files to remove. The files are removed from the appropriate contents files but are not removed from the DVS repository.

tag			Add a symbolic tag to the currently checkout files
	dvs tag [-b] [-r revision] tag [files]
		-b	Create a branch and apply the specified tag to all files in the branch
		-r revision	Tag the specified version
		tag	The tag to apply
		files	Which files to tag. If none are specified, the current directory is used.

update			Synchronize current files with those in the repository.
	dvs update [-r revision] [files]
		-r revision	Use the specified version in the repository as source for update operation
		files	Which files to update

Formats for Arguments

[1] Versions

v	only v
v1:v2	inclusive range
v1:	v1 and later
:v2	not later than v2

[2] Dates

Dates are specified in the format YYYYMMDD

[3] Modules

A module may have a suffix specifying a subdirectory within the module to checkout. In such a case, only that part of the module is checked out from the repository. For example, one could check out subdirectory sub of module mod with dvs checkout mod/sub.

References

[CLEARCASE]	Rational ClearCase http://www.rational.com/products/clearcase/index.jsp
[CVS]	CVS http://www.cvshome.org
[PERFORCE]	Perforce http://www.perforce.com
[PFIFB]	Perforce Inter-File Branching white paper http://www.perforce.com/perforce/barnch.html
[RCS]	RCS manual page rcsfile(5)
[SOURCESAFE]	Visual SourceSafe http://msdn.microsoft.com/ssafe
[STRGBL]	StreamGobbler http://www.javaworld.com/javaworld/jw-12-2000/jw-1229-traps.html
[SUBVERSION]	Subversion http://subversion.tigris.org
[VESTA]	Vesta Configuration Management System http://www.vestasys.org