|
Electronic
Records Management Guidelines - File Formats
Rapid
changes in technology mean that file formats can become
obsolete quickly and cause problems for your records
management strategy. A long-term view and careful planning
can overcome this risk and ensure that you can meet
your legal and operational requirements.
Legally,
your records must be authentic, complete, accessible,
legally admissible in court, and durable for as long
as your approved records retention schedules require.
For example, you can convert a record to another, more
durable format (e.g., from a nearly obsolete software
program to a text file). That copy, as long as it is
created in a trustworthy manner, is legally acceptable.
The
software in which a file is created usually has a default
format, often indicated by a file name suffix (e.g.,
*.PDF for portable document format). Most software allows
authors to select from a variety of formats when they
save a file (e.g., document [DOC], Rich Text Format
[RTF], text [TXT] in Microsoft Word). Some software,
such as Adobe Acrobat, is designed to convert files
from one format to another.
Legal
Framework: Key Concepts
As you consider the file format options available to
you, you will need to be familiar with the following
concepts:
- Proprietary
and non-proprietary file formats
- File
format types
- Preservation:
conversion and migration
- Compression
- Importance
of planning
- File
Format Decisions and Electronic Records Management
Goals
Proprietary and Non-proprietary File Formats
A file format is usually described as either proprietary
or non-proprietary:
Proprietary formats.
Proprietary file formats are controlled and supported
by just one software developer, or can only be read
by a limited number of other programs.
Non-proprietary
formats.
These formats are supported by more than one developer
and can be accessed with different software systems.
For example, RTF files can be opened in many word processing
software programs. Extensible Markup Language (XML)
is also becoming an increasingly popular non-proprietary
format.
File
Format Types
Below are brief descriptions of the basic files you
are likely to encounter. Basic file format types include:
- Text
files.
Text files are most often created in word processing
software programs. Common file formats for text files
include:
- Proprietary
formats, such as Microsoft Word files and WordPerfect
files, which carry the extension of the software
in which they were created.
-
RTF files, which are non-proprietary text files
saved with formatting instructions (such as page
layout).
- Portable
Document Format (PDF) files, which contain an
image of the page, including text and graphics.
PDF files are widely used for read-only file sharing.
However, only Adobe software (sometimes contained
in other software) can make a PDF file, and Acrobat
is necessary for reading a PDF file.
-
Graphics files.
Graphics files store an image (e.g., photograph, drawing)
and are divided into two basic types:
-
Vector-based files that store the image as geometric
shapes stored as mathematical formulas, which
allow the image to be scaled without distortion.
Common types of vector-based file formats include:
-
Drawing Interchange Format (DXF) files, which
are widely used in computer-aided design software
programs, such as those used by engineers
and architects
- Encapsulated
PostScript (EPS) files, which are widely used
in desktop publishing software programs
- Computer
Graphics Metafile (CGM) files, which are widely
used in many image-oriented software programs
(e.g., Photoshop) and offer a high degree
of durability
-
Computer Assisted Design (CAD)
- Raster-based
files that store the image as a collection of
pixels. Raster graphics are also referred to as
bitmapped images. Raster graphics cannot be scaled
without distortion. Common types of raster-based
file formats include:
- Tagged
Image File Format (TIFF) files, which are
widely usable in many different software programs
- Bitmap
(BMP) files, which are relatively low-quality
files used most often in word processing applications
Graphics Image File Format (GIFF) files, which
are widely used for Internet applications
-
(JPEG)
- Data
files.
Data files are created in database software programs.
Data files are divided into fields and tables
that contain discrete elements of information.
The software builds the relationships between
these discrete elements. For example, a customer
service database may contain customer name, address,
and billing history fields. These fields may be
organized into separate tables (e.g., one table
for all customer name fields). You may convert
data files to a text format, but you will lose
the relationships among the fields and tables.
For example, if you convert the information in
the customer database to text, you may end up
with ten pages of names, ten pages of addresses,
and a thousand pages of billing information, with
no indication of which information is related.
- Spreadsheet
files.
Spreadsheet files store the value of the numbers
in their cells, as well as the relationships of
those numbers. For example, one cell may contain
the formula that sums two other cells. Like data
files, spreadsheet files are most often in the
proprietary format of the software program in
which they were created. Some software programs
can import and export data from other sources,
including software programs designed for such
data sharing (e.g., Data Interchange Format [DIF]).
Spreadsheet files can be exported as text files,
but the value and relationship of the numbers
are lost.
- Video
and audio files.
These files contain moving images (e.g., digitized
video, animation) and sound data. They are most
often created and viewed in proprietary software
programs and stored in proprietary formats. Common
files formats in use include QuickTime and Motion
Picture Experts Group (MPEG) formats.
- Markup
languages.
Markup languages, also called markup formats,
contain embedded instructions for displaying or
understanding the content of the file. The World
Wide Web Consortium (W3C) (http://www.w3c.org/)
supports these standards. Common markup language
file formats include the following:
- Standard
Generalized Markup Language (SGML), a common
markup language used in government offices
worldwide, is an international standard. XML
& HTML (below) are types of SGML.
-
Hypertext Markup Language (HTML) is used to
display most of the information on the World
Wide Web.
- Extensible
Markup Language (XML) is a relatively simple
language based on SGML that is gaining popularity
for managing and sharing information.
Table 1: Common File Formats
| File
Format Type |
Common
Formats |
Sample
Files |
Description
|
| Text |
PDF, RTF, TXT, proprietary formats based on software
(e.g., Microsoft Word) |
Letters,
reports, memos, e-mail messages saved as text |
Created
or saved as text (may include graphics) |
| Vector
graphics |
DXF,
EPS, CGM |
Architectural
plans, high-quality photographs, complex illustrations |
Store
the image as geometric shapes in a mathematical
formula for undistorted scaling |
| Raster
graphics |
TIFF,
BMP, GIFF, JPEG |
Medium-quality
graphics for a web page, simple illustrations |
Store
the image as a collection of pixels which cannot
be scaled without distortion |
| Data
file |
Proprietary
to software program |
Human
resources files, mailing lists |
Created
in database software programs |
| Spreadsheet
file |
Proprietary
to software program, DIF |
Financial
analyses, statistical calculations |
Store
numerical values and calculations |
| Markup
languages |
SGML,
HTML, XML |
Text
and graphics to be displayed on a web site |
Contain
embedded instructions for displaying and understanding
the content of a file or multiple files |
| Video
and audio files |
QuickTime,
MPEG |
Short
video to be shown on a web site, recorded interview
to be shared on CD-ROM |
Contain
moving images and sound |
Preservation: Conversion and Migration
Your most basic decision about file formats will be
whether you want to convert and/or migrate your file
formats. If you convert your records, you will change
their formats, perhaps to a software-independent format.
If you migrate your records, you will move them to another
platform or storage medium, without changing the file
format. However, you may need to convert records in
order to migrate them to ensure that they remain accessible.
For example, if you migrate records from a Macintosh
operating system to a Windows operating system, you
need to convert the records to a file format that is
accessible in a Windows operating system (e.g., RTF,
Word for Windows).
You will face three basic types of loss determining
your course of action:
- Data.
If you lose data, you lose, to a varying degree, the
content of the record. Bear in mind that, legally,
your records must be complete and trustworthy.
-
Appearance.
You also risk loss of the structure of the record.
For example, if you convert all word processing documents
to RTF, you may lose some of the page layout. You
must determine if this loss affects the completeness
of the record. If the structure is essential to understanding
the record, this loss may be unacceptable.
-
Relationships.
Another risk is the loss of the relationships of the
data in the file (e.g., spreadsheet cell formulas,
database file fields). Again, this loss may affect
the legal requirement for complete records.
Keep in mind that a copy of a record is legally admissible,
only if it is created in a trustworthy manner and
is accurate, complete, and durable.
- Compression
As part of your strategy, you may choose to compress
your files. The pros and cons are summarized in Table
2 below.
Table
2: Pros and Cons of File Compression
| Pros |
Cons
|
| Saves
storage space |
May
result in data loss |
| More
quickly and easily transmittable |
Introduces
an additional layer of software dependency (the
compression software) |
The
greatest challenge in compressing files is that you
may lose data. Compression options vary in their degree
of data loss. Some are intentionally "lossy,"
such as the JPEG format, which relies on the human eye
to fill in the missing detail. Others are designed to
be "lossless." You may choose to compress
some files and not others.
Importance
of Planning
The challenges of preservation can be overcome with
good planning. Thoroughly discuss the issues raised
in the Key Issues to Consider section, to weigh the
specific pros and cons of each option for your agency.
File
Format Decisions and Electronic Records Management Goals
The goals of electronic records management that may
be affected by file format decisions include:
-
Accessibility.
The file format must enable staff members and the
public to find and view the record. In other words,
you cannot convert the record to a format that is
highly compressed and easy to store, but inaccessible.
- Longevity.
Developers should support the file format long-term.
If the file format will not be supported long-term,
you risk having records that are not durable, because
the software to read or modify the file may be not
be available.
- Accuracy.
If you convert your records, the file format you convert
to should result in records that have an acceptable
level of data, appearance, and relationship loss.
- Completeness.
If you convert your records, the file format you convert
to should meet your operational and legal objectives
for acceptable degree of data, appearance, and relationship
loss.
-
Flexibility.
The file format needs to meet your objectives for
sharing and using records. For example, you may need
to frequently share copies of the records with another
agency, use the records in your daily work, or convert
and/or migrate the records later. If the file format
can only be read by specialized hardware and/or software,
your ability to share, use, and manipulate the records
is limited.
Key
Issues to Consider
Now that you are familiar with some of the basic concepts
of file naming, you can use the questions below to discuss
how those concepts relate to your agency. Pay special
attention to the questions posed by the legal framework,
including the need for public accessibility as appropriate,
completeness, trustworthiness, durability, and legal
admissibility. Consider the degree of acceptable data,
appearance, and relationship loss. Take a long-term
approach so that your file formats will meet your operational
and legal requirements now and in the future.
Discussion
Questions
-
What are our goals for electronic records management?
- How
is our agency affected by the legal requirements?
-
What current file formats do we use? Will the developer
support these formats long-term?
- Are
we planning on converting and/or migrating our records?
- What
levels of data, appearance, and relationship loss
are acceptable?
- What
resources do we have for processing and maintaining
records?
- How
will our decisions affect other groups that may need
current and future access to our records (e.g., other
government agencies, the public)?
Definitions
lossless
A
term describing a data compression which retains all
the information in the data, allowing it to be recovered
perfectly by decompression. GIFF
and TIFF are examples of lossless techniques. (From
Dollar's Authentic Records)
lossy
A
term describing a data compression algorithm which actually
reduces the amount of information in the data, rather
than just the number of bits used to represent that
information. The lost information is usually removed
because it is subjectively less important to the quality
of the data (usually an image or sound ) or because
it can be recovered reasonably by interpolation from
the remaining data. MPEG
and JPEG are examples of lossy compression techniques.
The
Public Records Division, Kentucky Department for Libraries
and Archives, expresses its appreciation to the State
Archives Department, Minnesota Historical Society, for
permission to adapt its document, Electronic Records
Management Guidelines: File Formats (State Archives
Department, Minnesota Historical Society, 2001), for
use in Kentucky.
|