数据恢复 杭州站:杭州市文三路388号钱江科技大厦1016室  电话:0571-85121630 房先生
  数据恢复 温州站:飞霞南路朝霞大楼B幢一楼中五号  电话:0577-88844613 叶先生
  数据恢复 宁波站:海曙区药行街31号灵桥广场4楼C12 电话:0574-87196361 单先生
IT业热闻  
 
·思科IOS系统存在漏洞 计划
·互联网散布黄图 青海首起黑
·黑客盗走“哈利·波特”大
·美五角大楼国防部遭网络攻
·金山毒霸6增强版未获销售许
·Vista发布6个月共有27个安
·微软新补丁与Outlook冲突 
·电话支付种类繁多市场发展
·黑客瞄准谷歌商业广告 故伎
·美国农业部网站数据库漏洞
更多...
病毒预告  
 
·中国2006年度安全报告 新病
·“爱慕波”和“QQ谍”
·“戴得乐”和“搀嘴夫”
·电脑病毒“挪威客”又现新
·IE+雅虎邮件=新安全漏洞?小
·微软联手熊猫反病毒 推Pro
·Linux与视窗谁更安全?最新
·病毒引发的异常现象
更多...
热点文章  
 

·IBM机器报错信息及解决方法
·谈谈硬盘出现物理坏道的迹
·数据恢复点滴经验谈
·Google购买以色列博士生搜
·ibm600x/600e电池电芯更换
·硬盘数据拯救全攻略
·硬盘零磁道与分区表的修复
·微软最新安全补丁不完善 惠
·故障硬盘数据拯救全攻略
·硬盘的数据结构

Doc 文件格式 
 
来源: 发布时间:2006-2-24 22:27:00 点击次数: 

并未完全整理请等待。

The Doc format is the de facto standard for large text documents on the Palm Computing Platform. It enjoys wide support in both software and content, but documentation is sparse. This document is an attempt to describe the Doc format for the edification of programmers who are interested in writing Doc-compatible software, and to encourage programmers not to break the format in incompatible ways.

This document is totally unofficial, and derived from examination of existing Doc files and applications.

Overview

A Doc-format e-text is an ordinary PalmPilot database, represented on the desktop by a file in the standard .prc/.pdb format. (Describing that format is currently beyond the scope of this document.) The database is divided into three sections, which appear in order:

  • A header record
  • A series of text records
  • A series of bookmark records

Note that all values are stored MSB first, as is usual on the PalmPilot.

The Header Record

The first record in a Doc database is a header. Existing Doc creation programs create a 16-byte header, with contents as described below; many Doc readers extend this record once the database is installed, to hold additional reader-specific information.

Doc Header Format
version 2 bytes 0x0002 if data is compressed, 0x0001 if uncompressed
spare 2 bytes purpose unknown (set to 0 on creation)
length 4 bytes total length of text before compression
records 2 bytes number of text records
record_size 2 bytes maximum size of each record (usually 4096; see below)
position 4 bytes currently viewed position in the document
sizes 2*records bytes record size array

The position field is not used by all readers; some store this information elsewhere.

AportisDoc (Reader and Mobile Edition) set spare to 0x0003, and overwrite the first two bytes of length with zeros (even if the document is more than 64k bytes in length!) upon first opening the document.

The sizes array is a list of two-byte unsigned integers giving the uncompressed size of each text record, in order. It is created by some readers (AportisDoc, TealDoc, Doc, and possibly others) when the document is first opened.

Text Records

Following the header record is a series of text records, each one of which represents a text block no greater than record_size bytes in length. Most conversion software creates blocks of 4096 bytes (except for the last one); the format provides for other block sizes and for records of varying lengths, but it is likely that some Doc-handling software cannot deal with anything but fixed 4096-byte records.

In a version 1 database, each block of text is simply stored in a single record. In a version 2 database, each block of text is individually compressed, making the actual record size somewhat smaller -- note that the block size refers to the uncompressed size of a text block.

Compression Algorithm

Note: The original designer of the Doc compression format, Pat Beirne, has reposted one of his original messages describing the algorithm. If you are curious about why it works the way it does, check it out.

Each text block (in a version 2 database) is individually compressed using a simple one-pass algorithm. As I am far from an expert in compression algorithm design, I shall simply describe what the data looks like and refer anyone interested in more details to the code (which is readily available in a variety of places, such as in the source to txt2pdbdoc or the source to Pyrite.

The output of the compression algorithm is a stream of bytes, described here with the action taken by the decompressor when they are encountered:

 
Compression Byte Codes
0x01-0x09 Copy the following N bytes verbatim
0x0a-0x7f Pass through as-is
0x80-0xbf Copy a sequence from a previous part of the block
0xc0-0xff Insert a space followed by N xor 0x80

When a copy-sequence byte code is encountered, it is used as the high byte of a two byte quantity, along with the next byte in the data (resulting in a value from 0x8000-0xbfff). This value is then ANDed with 0x3fff, resulting in a value from 0x0000 to 0x3fff. It is further subdivided into an offset (the upper 11 bits, which are shifted down appropriately) and a length (the lower 3 bits). The actual data in the output is located by subtracting the offset from the current position in the decompressed data; the number of bytes copied is equal to the length plus 3.

Bookmark Records

Following the text records is an optional series of bookmark records. Each bookmark occupies a single record, and they are usually presented by the reader in the same order they appear in the database. The format of a bookmark record is rather simple:

name 16 bytes bookmark name (up to 15 characters, null terminated)
position 4 bookmark position, from beginning of text

Note that the bookmark name field is always 16 bytes wide, even if the name is shorter, and that the position is in actual text bytes before compression.

Common Conventions

Bookmark Autoscan

Because most Doc creation programs do not add bookmark records to their output, most Doc readers support an alternative method for authors to specify bookmark locations in a document. The reader scans the document the first time it is opened, looking for a specified string at the start of lines. Each time it is found, the reader adds a bookmark using the text on the rest of the line. By convention, the text to scan for is placed on the last line of the document, surrounded by angle brackets (< and >).

TealDoc-Specific Extensions

The current TealDoc extensions are implemented by the use of HTML-like tags embedded in the text of the document. Although TealDoc tags look like HTML, TealDoc's parser is not as robust as that of a desktop web browser; the following limitations have been observed in practice:

  • Tags, attributes, and keyword values must be in all upper case
  • Each tag must appear alone on a single line; attempting to embed a tag in the middle of a line of text will cause unpredictable results.
  • Text attribute values should be surrounded by double quotes; keyword and numeric values should not be quoted.

Other Extensions

Besides TealDoc, other Doc readers also extend the standard e-text database format. Some of these extensions will be more fully documented later; for the time being, this section contains a few notes in the hopes that future developers will be able to avoid compatibility problems. Please note that the notes in this section should not be considered authoritative or complete; if you are developing Doc software, you should investigate this stuff for yourself.

QED Extensions

QED, the Doc editor from Visionary 2000, adds an appinfo block, simultaneously marking the document with its version number (in the database header).

RichReader Extensions

RichReader, the rich text document reader by Michael Arena, supports formatting control codes (font changes, indentation, etc.) embedded in the document text. When viewed on another reader, RichReader documents may appear to contain "garbage" characters, since many of the formatting codes use non-printable or extended ASCII characters.

LinkDoc Extensions

Mobile LinkDoc, a reader from Mobile Generation Software, stores links between documents by adding extended bookmark records to the document being linked from.

Extensions Which Do Not Affect the Doc Format

A number of readers (nearly all of them, in fact) store additional information in databases separate from the documents themselves, leaving the documents unaltered. For example, category information is normally stored externally. These product-specific databases will not, at the present time, be documented here, because they do not affect the document format itself.

会员名称:
密码:匿名 ·注册·忘记密码?
评论内容:
(最多300个字符)
  查看评论
关于我们 | 联系我们 | 网站地图 | 友情链接 | 订单查询
杭州回天数据恢复中心:杭州市文三路388号钱江科技大厦1016室 客服热线:0571-85125595 85121630
Copyright © 2001-2008 回天数据恢复中心 All Rights Reserved E-mail:webmaster@tzwr.com
浙ICP05036415号