版本库开发策略

Due largely to the simplicity of the overall design of the Subversion repository and the technologies on which it relies, creating and configuring a repository are fairly straightforward tasks. There are a few preliminary decisions you'll want to make, but the actual work involved in any given setup of a Subversion repository is pretty basic, tending towards mindless repetition if you find yourself setting up multiples of these things.

下面是一些你需要预先考虑的事情:

在本节,我们要尝试帮你回答这些问题。

规划你的版本库结构

在Subversion版本库中,移动版本化的文件和目录不会损失任何信息,甚至也可以将版本库的的一组数据无损历史的移植到另一个版本库,但是这样一来那些经常访问版本库并且以为文件总是在同一个路径的用户可能会受到干扰。为将来着想,最好预先对你的版本库布局进行规划。以一种高效的“布局”开始项目,可以减少将来很多不必要的麻烦。

假如你是一个版本库管理员,需要向多个项目提供版本控制支持。那么,你首先要决定的是,用一个版本库支持多个项目,还是为每个项目建立一个版本库,还是两种方法的混合方式。

There are benefits to using a single repository for multiple projects, most obviously the lack of duplicated maintenance. A single repository means that there is one set of hook programs, one thing to routinely back up, one thing to dump and load if Subversion releases an incompatible new version, and so on. Also, you can move data between projects easily, without losing any historical versioning information.

单一版本库的缺点是,不同的项目通常都有不同的版本库触发事件需求,例如需要发送提交通知邮件到不同的邮件列表,需要不同的鉴定提交是否合法的定义。这些都不是不可逾越的问题,当然—之需要你的钩子程序能够察看版本库的布局,而不是假定整个版本库与同一组人关联。还有,别忘了Subversion的修订版本号是针对整个版本库的,这些号码没有任何魔力。即使最近没有对某个项目作出修改,版本库的修订版本号还是会因为其它项目的修改而不停的提升,许多人并不喜欢这样的事实。[25]

可以采用折中的办法。比如,可以把许多项目按照彼此之间的关联程度划分为几个组合,然后为每一个项目组合建立一个版本库。这样,在相关项目之间共享数据依旧很简单,而如果修订版本号有了变化,至少开发人员知道,改变的东西多少和他们有些关系。

After deciding how to organize your projects with respect to repositories, you'll probably want to think about directory hierarchies within the repositories themselves. Because Subversion uses regular directory copies for branching and tagging (see 第 4 章 分支与合并), the Subversion community recommends that you choose a repository location for each project root—the “top-most” directory that contains data related to that project—and then create three subdirectories beneath that root: trunk, meaning the directory under which the main project development occurs; branches, which is a directory in which to create various named branches of the main development line; and tags, which is a collection of tree snapshots that are created, and perhaps destroyed, but never changed. [26]

举个例子,一个版本库可能会有如下的布局:

/
   calc/
      trunk/
      tags/
      branches/
   calendar/
      trunk/
      tags/
      branches/
   spreadsheet/
      trunk/
      tags/
      branches/
   …

项目在版本库中的根目录地址并不重要。如果每个版本库中只有一个项目,那么就可以认为项目的根目录就是版本库的根目录。如果版本库中包含多个项目,那么可以将这些项目划分成不同的组合(按照项目的目标或者是否需要共享代码甚至是字母顺序)保存在不同子目录中,下面的例子给出了一个类似的布局:

/
   utils/
      calc/
         trunk/
         tags/
         branches/
      calendar/
         trunk/
         tags/
         branches/
      …
   office/
      spreadsheet/
         trunk/
         tags/
         branches/
      …

按照你认为合适的方式安排版本库的布局,Subversion自身并不强制或者偏好某一种布局形式,对于Subversion来说,目录就是目录。最后,在设计版本库布局的时候,不要忘了考虑一下项目参与者们的意见。

为了完整性,我们需要提一下另一种常见的布局,在这种布局中trunktagsbranches都在根目录下,而你的项目在各个子目录下,例如:

/
   trunk/
      calc/
      calendar/
      spreadsheet/
      …
   tags/
      calc/
      calendar/
      spreadsheet/
      …
   branches/
      calc/
      calendar/
      spreadsheet/
      …

There's nothing particularly incorrect about such a layout, but it may or may not seem as intuitive for your users. Especially in large, multiproject situations with many users, those users may tend to be familiar with only one or two of the projects in the repository. But the projects-as-branch-siblings tends to de-emphasize project individuality and focus on the entire set of projects as a single entity. That's a social issue though. We like our originally suggested arrangement for purely practical reasons—it's easier to ask about (or modify, or migrate elsewhere) the entire history of a single project when there's a single repository path that holds the entire history—past, present, tagged, and branched—for that project and that project alone.

决定在哪里与如何部署你的版本库

Before creating your Subversion repository, an obvious question you'll need to answer is where the thing is going to live. This is strongly connected to a myriad of other questions involving how the repository will be accessed (via a Subversion server or directly), by whom (users behind your corporate firewall or the whole world out on the open Internet), what other services you'll be providing around Subversion (repository browsing interfaces, email-based commit notification, etc.), your data backup strategy, and so on.

We cover server choice and configuration in 第 6 章 服务配置, but the point we'd like to briefly make here is simply that the answers to some of these other questions might have implications that force your hand when deciding where your repository will live. For example, certain deployment scenarios might require accessing the repository via a remote filesystem from multiple computers, in which case (as you'll read in the next section) your choice of a repository backend data store turns out not to be a choice at all because only one of the available backends will work in this scenario.

Addressing each possible way to deploy Subversion is both impossible and outside the scope of this book. We simply encourage you to evaluate your options using these pages and other sources as your reference material and to plan ahead.

选择数据存储格式

As of version 1.1, Subversion provides two options for the type of underlying data store—often referred to as “the backend” or, somewhat confusingly, “the (versioned) filesystem”—that each repository uses. One type of data store keeps everything in a Berkeley DB (or BDB) database environment; repositories that use this type are often referred to as being “BDB-backed.” The other type stores data in ordinary flat files, using a custom format. Subversion developers have adopted the habit of referring to this latter data storage mechanism as FSFS [27] —a versioned filesystem implementation that uses the native OS filesystem directly—rather than via a database library or some other abstraction layer—to store data.

表 5.1 “”从总体上比较了Berkeley DB和FSFS版本库。

表 5.1. 

分类特性Berkeley DBFSFS
可靠性数据完整性When properly deployed, extremely reliable; Berkeley DB 4.4 brings auto-recovery.Older versions had some rarely demonstrated, but data-destroying bugs.
对操作中断的敏感Very; crashes and permission problems can leave the database “wedged,” requiring journaled recovery procedures.Quite insensitive.
可用性可只读加载No.Yes.
存储平台无关No.Yes.
可从网络文件系统访问Generally, no.Yes.
组访问权处理Sensitive to user umask problems; best if accessed by only one user.Works around umask problems.
伸缩性版本库磁盘使用情况Larger (especially if logfiles aren't purged).Smaller.
修订版本树的数量Database; no problems.Some older native filesystems don't scale well with thousands of entries in a single directory.
有很多文件的目录Slower.Faster.
性能检出最新的代码No meaningful difference.No meaningful difference.
大的提交Slower overall, but cost is amortized across the lifetime of the commit.Faster overall, but finalization delay may cause client timeouts.

There are advantages and disadvantages to each of these two backend types. Neither of them is more “official” than the other, though the newer FSFS is the default data store as of Subversion 1.2. Both are reliable enough to trust with your versioned data. But as you can see in 表 5.1 “”, the FSFS backend provides quite a bit more flexibility in terms of its supported deployment scenarios. More flexibility means you have to work a little harder to find ways to deploy it incorrectly. Those reasons—plus the fact that not using Berkeley DB means there's one fewer component in the system—largely explain why today almost everyone uses the FSFS backend when creating new repositories.

Fortunately, most programs that access Subversion repositories are blissfully ignorant of which backend data store is in use. And you aren't even necessarily stuck with your first choice of a data store—in the event that you change your mind later, Subversion provides ways of migrating your repository's data into another repository that uses a different backend data store. We talk more about that later in this chapter.

The following subsections provide a more detailed look at the available backend data store types.

Berkeley DB

When the initial design phase of Subversion was in progress, the developers decided to use Berkeley DB for a variety of reasons, including its open source license, transaction support, reliability, performance, API simplicity, thread-safety, support for cursors, and so on.

Berkeley DB提供了真正的事务支持-这或许是它最强大的特性,访问你的Subversion版本库的多个进程不必担心偶尔会破坏其他进程的数据。事务系统提供的隔离对于任何给定的操作,Subversion版本库代码看到的只是数据库的静态视图-而不是一个在其他进程影响不断变化的数据库-并能够根据该视图作出决定。如果该决定正好同其他进程所做操作冲突,整个操作会回滚,就像什么都没有发生一样,并且Subversion会优雅的再次对更新的静态视图进行操作。

Another great feature of Berkeley DB is hot backups—the ability to back up the database environment without taking it “offline.” We'll discuss how to back up your repository later in this chapter (in “版本库备份”一节), but the benefits of being able to make fully functional copies of your repositories without any downtime should be obvious.

Berkeley DB is also a very reliable database system when properly used. Subversion uses Berkeley DB's logging facilities, which means that the database first writes to on-disk logfiles a description of any modifications it is about to make, and then makes the modification itself. This is to ensure that if anything goes wrong, the database system can back up to a previous checkpoint—a location in the logfiles known not to be corrupt—and replay transactions until the data is restored to a usable state. See “管理磁盘空间”一节 later in this chapter for more about Berkeley DB logfiles.

But every rose has its thorn, and so we must note some known limitations of Berkeley DB. First, Berkeley DB environments are not portable. You cannot simply copy a Subversion repository that was created on a Unix system onto a Windows system and expect it to work. While much of the Berkeley DB database format is architecture-independent, there are other aspects of the environment that are not. Secondly, Subversion uses Berkeley DB in a way that will not operate on Windows 95/98 systems—if you need to house a BDB-backed repository on a Windows machine, stick with Windows 2000 or newer.

然而Berkeley DB对于在网络共享上工作提出了一组规范,[28]大多数网络文件系统和应用没有实现这个要求,所以不能允许在网络共享上的BDB后端版本库被多个客户端同时访问(首先要知道版本库存放在网络共享上是非常普遍的)。

警告

If you attempt to use Berkeley DB on a noncompliant remote filesystem, the results are unpredictable—you may see mysterious errors right away, or it may be months before you discover that your repository database is subtly corrupted. You should strongly consider using the FSFS data store for repositories that need to live on a network share.

最后,因为Berkeley DB的库直接链接到了Subversion中,它对于中断比典型的关系型数据库系统更为敏感。大多数SQL系统,举例来说,有一个主服务进程来协调对数据库表的访问。如果一个访问数据库的程序因为某种原因出现问题,数据库守护进程察觉到连接中断会做一些清理。因为数据库守护进程是唯一访问数据库表的进程,应用程序不需要担心访问许可的冲突。但是,这些情况与Berkeley DB不同。Subversion(和使用Subversion库的程序)直接访问数据库的表,这意味着如果有一个程序崩溃,就会使数据库处于一个暂时的不一致、不可访问的状态。当这种情况发生时,管理员需要让Berkeley DB恢复到一个检查点,这的确有点讨厌。除了崩溃的进程,还有一些情况能让版本库出现异常,比如程序在数据库文件的所有权或访问权限上发生冲突。

注意

Berkeley DB 4.4(对应Subversion 1.4和更高)提供了在需要恢复时自动恢复Berkeley DB环境的能力,当Subversion进程发现任何以前进程未清理的连接,就会执行所有可能的恢复,然后就当什么都没有发生一样继续执行。这样不会完全消除版本库楔住的可能,但是大大减少了人工干预恢复的数量。

So while a Berkeley DB repository is quite fast and scalable, it's best used by a single server process running as one user—such as Apache's httpd or svnserve (see 第 6 章 服务配置)—rather than accessing it as many different users via file:// or svn+ssh:// URLs. If accessing a Berkeley DB repository directly as multiple users, be sure to read “支持多种版本库访问方法”一节 later in this chapter.

FSFS

In mid-2004, a second type of repository storage system—one that doesn't use a database at all—came into being. An FSFS repository stores the changes associated with a revision in a single file, and so all of a repository's revisions can be found in a single subdirectory full of numbered files. Transactions are created in separate subdirectories as individual files. When complete, the transaction file is renamed and moved into the revisions directory, thus guaranteeing that commits are atomic. And because a revision file is permanent and unchanging, the repository also can be backed up while “hot,” just like a BDB-backed repository.

修订版本文件格式代表了一个修订版本的目录结构,文件内容,和其它修订版本树中相关信息。不像Berkeley DB数据库,这种存储格式可跨平台并且与CPU架构无关。因为没有日志或用到共享内存的文件,数据库能被网络文件系统安全的访问和在只读环境下检查。缺少数据库花消同时也意味着版本库的总体体积可以稍小一点。

FSFS也有一种不同的性能特性。当提交大量文件时,FSFS可以更快的追加条目。另一方面,FSFS通过写入与上一个版本比较的变化来记录新版本,这也意味着获取最新修订版本时会比Berkeley DB慢一点,提交时FSFS也会有一个更长的延迟,在某些极端情况下会导致客护端在等待回应时超时。

The most important distinction, however, is FSFS's imperviousness to wedging when something goes wrong. If a process using a Berkeley DB database runs into a permissions problem or suddenly crashes, the database can be left in an unusable state until an administrator recovers it. If the same scenarios happen to a process using an FSFS repository, the repository isn't affected at all. At worst, some transaction data is left behind.

The only real argument against FSFS is its relative immaturity compared to Berkeley DB. Unlike Berkeley DB, which has years of history, its own dedicated development team, and, now, Oracle's mighty name attached to it, [29] FSFS is a newer bit of engineering. Prior to Subversion 1.4, it was still shaking out some pretty serious data integrity bugs, which, while only triggered in very rare cases, nonetheless did occur. That said, FSFS has quickly become the backend of choice for some of the largest public and private Subversion repositories, and it promises a lower barrier to entry for Subversion across the board.



[25] 无论是在忽略情况下建立或很少考虑过如何产生正确的软件开发矩阵,都不应该愚蠢的担心全局的修订版本号码,这不应该成为安排项目和版本库的理由。

[26] The trunk, tags, and branches trio are sometimes referred to as “the TTB directories.

[27] Often pronounced “fuzz-fuzz,” if Jack Repenning has anything to say about it. (This book, however, assumes that the reader is thinking “eff-ess-eff-ess.”)

[28] Berkeley DB需要底层的文件系统实现严格的POSIX锁定语法,更重要的是,将文件直接映射到内存的能力。

[29] Oracle在2006情人节购买了Sleepycat和它的旗舰软件Berkeley DB。