File formats for long-term archiving

Repositories for all kinds of data usually require or at least recommend file formats suitable for long-term archiving. These recommendations are frequently incomplete, sometimes contradict each other, or appear not totally consistent to me. This is an attempt to start a discussion that might contribute to a consolidation of best practices. The following thoughts are also incomplete and subjective and inconsistent and should only serve to get the discussion going.

Definition long-term archiving

The data should be still usable when the current curators are not available anymore, when the data might not be stored in the current repository anymore, and at a time for which the available soft- and hardware is unknown today. Think ~100 years.

Criteria

A potential future user has to have the ability to implement from scratch the necessary software to interpret the data. Consequently,

proprietary (non-open) formats can be ruled out, and
open, but patent-encumbered formats should be avoided.
Simple format descriptions have precedence over more complicated ones.
Formats should be described by a reasonably established international standards organization such as W3C, IEC or ISO.

As a repository operator I want to enable the submitting users to directly submit data in acceptable formats. While automatic conversion would be frequently possible, this would raise issues regarding the responsibility for data quality and correctness. Therefore

The software to transform to the recommended format should be readily available, and the process should be easy to explain and to understand.
Ideally, the format should be directly supported by applications that a majority of users already employ to handle that data. Mainstream takes precedence over exotic.
Software to read the format should be available on all major platforms (operating systems) in use today. Interoperability is a benefit in itself, but the number of cross-platform implementations is also indicative of the perceived usefulness, actual use and technical sanity of the format.

In particular 5. and 6. seem to be not considered by most recommendations I came across so far. Those criteria appear rather important to me, when the aim of the repository operation includes to make life easy for the data producers who otherwise might choose not to submit anything in the first place.

The type of the file format needs to be communicated as well.

File formats that indicate their type in their own bytestream, e.g., through a particular start sequence of bytes, are preferable.
If the format has to be communicated in meta-data, that should be as simple as possible. It is a plus if a naming convention exists (e.g. through the file extension) that identifies the format as precisely as possible.
The identification of the file-format should be as straight- forward as possible, in particular to enable automatic detection.

Further,

The promoted format should not encourage or enforce a conversion that looses information.
The format should allow for automatic validation.

Raw text

Everybody agrees that this should be encoded as ASCII or UTF-8 or UTF-16 or UTF-32 or ISO8859-1. It would be good to settle on one. UTF-8 has the dominant position today, as per requirement for a lot of internet protocols, requirement as encoding for many higher level formats such as markup languages, multiple standardization by different organizations and its general ubiquity. Also ASCII is a subset of UTF-8.

It seems to have a number of other advantages: No byte-order ambiguities, and it is simply detected by decoding correctly (http://stackoverflow.com/a/4522251). According to IANUS the BOM should be omitted and that seems to make sense to me. Often this is not mentioned though.

Why accept anything else at all? I suggest to discourage the use of other encodings.

Text with layout

This quite loosely can include the description of a page (margins, size, etc.) and style elements such as font and colors, graphical elements, etc.

PDF/A

Everybody agrees that PDF/A is good for this. Which PDF/A ?

Presumably PDF/A-1 and PDF/A-2 with respective levels a and b and PDF/A-2u are good.

Clearly, PDF/A-3 is not good, because it allows embedding of arbitrary other formats.

Just stating that PDF/A is OK might lead people to submit bitmaps embedded in PDF. Is that OK? Or should a bitmap format be promoted for such cases?

Office Document Standards

There are two: OOXML and ODF. In general such formats are very complicated because the are designed to describe at least the output of word processors, presentation creation software, and spreadsheet software. They are generally less robust than less featured formats, are relatively young and new versions are developed at a comparatively fast rate.

Therefore, these formats should be acceptable only as a last resort, in case the data can’t be represented in a more suitable formats, e.g. wordprocessing documents and presentations as PDF/A and spreadsheets as plain text tables (e.g. CSV).

Office Open XML (OOXML)

OOXML describes file formats for “office documents” and is almost exclusively used as the default format for Microsoft Office output (.docx, .pptx, .xlsx). In many recommendations these formats are marked as acceptable for long-term archiving, presumably their ISO standardization is taken as an indicator of suitability. This judgment eludes me since these formats currently do not even provide a robust form of day-to-day storage. The criticism of these formats is abundant, here the most important points, in my opinion:

Actually there are three standards for OOXML which are incompatible to each other, and are implemented to varying degrees in currently available software (ECMA-376, ISO/IEC 29500 Transitional, and ISO/IEC 29500 Strict).
OOXML is much more complicated in structure than the competing ODF and the size of the standard description (6546 pages) compares unfavorably wiht that of ODF (867 pages). And this doesn’t include the 1500 pages necessary to describe the difference between ISO/IEC 29500 Transitional and ISO/IEC 29500 Strict.
There exist only few software packages today that implement some variety of OOXML and it is not immediately clear to the user which version is actually written when she clicks on “save as docx”, for example.
Apparently there is no or only broken versioning (inclusion of a pointer to the version of the standard of the document) in OOXML, which preculdes automatic validation, see here and here (german). Apart from that, the complexity of the standards alone apparently did not allow for the development of validation tools.
OOXML seems to allow to embed various external media formats and ActiveX controls.

There is an in-depth analysis of the problems with OOXML by Markus Feilner and another overview by Markus Dapp in German.

The integrity of the ISO standardization process has been questioned and Wikipedia has a lengthy entry about that.

Every IT group supporting a heterogeneous infrastructure will sing long and sad songs about the interoperability problems created by OOMXL. Long-term archives should be kept strictly clean of OOXML, as should be even intra-institution medium-term archives.

Open Document Format for Office Applications (ODF)

ODF is a comparatively sane standard that is maintained and developed by a broad industry consortium (OASIS). A large number of software packages across all major platforms support it. ODF should be used if storing an “office format” is absolutely necessary at all. (Are there cases where this is so?)

Still Image

This is either JPEG2000 or TIFF or DNG or PNG or GIF (GIF sometimes is mistakenly described as using lossy compression)

For JPEG2000 and TIFF there is a zoo of substandards that is hard to disentangle. The matter is complicated by the fact that at least TIFF can serve as a container that contains, for example, PNG. Both JPEG2000 and TIFF also allow for lossy compression even though that appears rather unpopular.

My ad-hoc recommendation would be to keep any of above formats if the original data to be submitted comes in such a format. A conversion could result in loss of information, even among lossless compressed formats, due to different meta-data capabilities or different capabilities of color representation. Usually there is advice against using GIF because its compression algorithm (LZW) is patent encumbered. According to Wikipedia though, these patents expired long ago making this a non-issue.

JPEG is mostly used for lossy compressed images and conversion from lossless formats must be avoided. However, if the original data is in JPEG there seems to be no reason not to archive it like that.

In future installments I am planning to write about

Tabular Data
Vector Graphics
Audio
Video
Geospatial Data

Resources

http://www.library.ethz.ch/en/Media/Files/File-formats-for-archiving

http://www.digitalpreservation.gov/formats/index.shtml

http://www.ianus-fdz.de/it-empfehlungen/dateiformate

http://www.preforma-project.eu/index.html

Click to access selecting-file-formats.pdf

Claire October 4, 2016 / 12:25 am

so glad you are looking thinking about this with reference to long term data management. we are currently also working on a ckan instance for scientific data management, and trying to integrate an entire lifecycle flow from set up a data management plan to data upload in ckan and github.
Cheers

Claire

LikeLiked by 1 person

ckan4rdm October 7, 2016 / 11:38 am

Hi Claire,

thanks for the feedback! Maybe you want to put a little blurb about your project in the Project-Section (https://ckan4rdm.wordpress.com/projects/), so that at one point we might have available a collection of research management projects using CKAN.

Cheers,
Harald

LikeLike

	ckan4rdm on File formats for long-term…
	Claire on File formats for long-term…
	ckan4rdm on List of Authors
	ckan4rdm on List of Authors
	Landcare DataStore on Private resources in CKAN

ckan4rdm

CKAN for Research data management