Programs and file formats change over time such that old files may become difficult to read. This complicates using digital information over long term.
In this manual the ETH Library recommends suitable file formats for archiving your data. We also explain how to convert your files in suitable formats, and how you can use the software DROID to track unsuitable files even from large data collections.
Our recommendations apply to long-term preservation of digital publications and research data in general and are not a prerequisite for submitting data to the Research Collection.
1. Assessment of various file formats
Table 1: Our assessment of future readability of some common file formats. Please note that we cannot guarantee usability over these time periods.
Suitable for use over more than ten years
Use limited to ten years
Not suitable for archiving
- PDF/A (*.pdf)
- Plain Text (*.txt, *.asc, *.c, *.h, *.cpp, *.m, *.py etc.) coded as ASCII, UTF-8, or UTF-16 using byte order mark
- XML (inclusive XSD/XSL/XHTML etc.; with included or accessible schema and character encode explicitly specified)
PDF (*.pdf) with embedded fonts
Plain text (*.txt, *.asc, *.c, *.h, *.cpp, *.m, *.py etc.) (ISO 8859-1 coded)
Rich Text Format (*.rtf)
HTML (include a DOCTYPE declaration)
LaTeX, TeX (The ASCII text is readable over long term; open source software required for formatting should be included.)
HTML and XML (The ASCII text is readable over long term; try to avoid external links.)
OpenDocument formats (*.odm, *.odt, *.odg, *.odc, *.odf)
- Word *.doc
- PowerPoint *.ppt
Spreadsheet or table
- Comma- or tab delimited text files (*.csv)
- Excel *.xlsx (container format)
- OpenDocument spreadsheets (*.ods)
- Excel *.xls, *.xlsb (binary formats)
- Text files for S-Plus (*.sdd). The ASCII Text is suitable for long-term use, but not the data import into the work space.
- Matlab *.mat files should be saved in HDF format, as nontrivial ASCII Matlab *.mat files are not readable with the Matlab load command (see table 2).
- Network Common Data Format or NetCDF (*.nc, *.cdf)
- Binary files such as the standard Matlab files *.mat or the R files *.RData
- TIFF (*.tif) (uncompressed, preferentially TIFF 6.0, Part 1: baseline TIFF). TIFF is preferred as compared to PNG or JPEG2000.
- PNG (uncompressed)
JPEG2000 (lossless compression)
- TIFF (*.tif) (compressed)
- GIF (*.gif)
- BMP (*.bmp)
- JPEG/JFIF (*.jpg)
- JPEG2000 (lossy compression) (*.jp2)
Graphics InDesign (.indd), Illustrator (.ait)
Encapsulated Postscript (EPS)
- AutoCAD Drawing (*.dwg)
- Drawing Interchange Format, AutoCAD (*.dxf)
- Extensible 3D, X3D (*.x3d, *.x3dv, *.x3db)
- WAV (*.wav) (uncompressed, pulse-code modulated)
- Advanced Audio Coding (*.mp4)
- MP3 (*.mp3)
- Motion JPEG 2000 (ISO/ IEC15444-4) (*.mj2)
- AVI (uncompressed, motion JPEG) (*.avi)
- QuickTime Movie (uncompressed, motion JPEG) (*.mov)
- MPEG-1, MPEG-2 (*.mpg,*.mpeg, wrapped into the container format AVI or MOV)
- MPEG-4 (H.263, H.264) (*.mp4, wrapped into the container format AVI or MOV)
- Windows Media Video (*.wmv)
1.1 Use limited to ten years
If you plan using your data for ten years or less we recommend the formats in the middle and the left column of Table 1. Even less known formats that are common in your area of expertise for this type of data are usually suitable.
You should also consider the following points:
- Files in rare formats should be converted into common formats whenever possible. You should archive the original file and the converted file.
- The files should not be dependent on references to data, templates, fonts, or programs stored elsewhere, but instead, such objects should be archived too. If this is not possible, you should describe the existing dependencies on other files or programs in a plain text file ("readme"). You then archive the readme file together with the data.
- Files should not be password protected, encrypted or compressed. If you absolutely neet to encypt data, please configure access rights such that data can still be opened after your departure.
- Use only letters, numbers, underscore (_) and hyphen (-) for naming folders and files. Avoid spaces, slashes, and other special characters. For more infomation see this guidline.
- The file extension should be consistent with the actual file format.
1.2 Use for more than ten years
To use files for more than ten years, you should follow the recommendations given above. Furthermore, the file formats should be very common and, if possible, follow standards that are open and not proprietary. However, it cannot be guaranteed that your data will remain readable over the long term, as this depends on future software developments.
For storage over more than ten years, we recommend file formats in the left column of Table 1, such as PDF/A, ASCII text, and TIFF. Bear in mind that the future readability of a file will also strongly depend on the used file features: Reading fancy features of a format, such as video data within a PDF file, will be less reliable than reading basic features.
For more detailed information we refer to the recommendations of the Bundesarchiv (German), the KOST (German or French), the Forschungsdatenzentrums Archäologie & Altertumswissenschaften IANUS (Germany), the Library of Congress and the Harvard Library. The table in Rimkus et al, 2014 summarizes recommendations of many archives.
The ETH Library will review the archived file formats regularly and will attempt to convert outdated formats into more common formats. The original file will always be kept.
2. Recommended conversion methods
We recommend the conversion methods shown in Table 2. Useful conversions also depend on the type of information that is stored in the files. You may store your Excel spread sheets in *.csv files, but if the Excel file contains also macros, equations or embedded objects, this information will be lost.
You should check the quality of your converted files. The original and the converted files should be archived.
Some more recent file types (*.docx, *.xlsx, *.pptx) are so-called container files. By attaching the file extension “.zip” to the file name you can check the single file components. You may also save such simpler files separately.
Table 2: Recommended file conversions
- Word and PowerPoint files should be converted to PDF/A-1b files. According to our tests the following method for converting Microsoft Word or Microsoft PowerPoint files usually leads to acceptable results: Open the file using Word or PowerPoint, click “File” and then “Print” to open the Print dialog box. Select the printer „Adobe PDF“. Click on „Printer Properties“ and there select „PDF/A-1b: 2005 (RGB)“. Then click on the button “Print”. See also the instruction on Creating PDF Files.
- LaTeX or TeX files should be converted to PDF/A files.
- You should carefully check the quality of your converted files. Verify equations, special characters, special fonts, spelling errors, searching and selecting of text, tables, colours, transparent objects, comments, vector graphics and layered graphics.
- Convert Excel *.xls files to *.xlsx files
- You may save a copy of embedded objects (such as figures) as independent files.
- Tables may be converted to ASCII text *.csv files: In Excel you may save sheets as *.csv files; in R you may save tables with “write.csv”; and in S-Plus you may use „write.table“ to save as *.sdd files.
- Matlab *.mat files should be saved as v7.3 files (using save -v7.3 x.mat), as the produced *.mat File follows a HDF5-based standard. ( HDF5 is an open standard for tables, media data and complex data structures.)
- The R workspace should be saved with the R-package rhadf5 in a HDF5 format. The S-Plus function data.dump produces a file that can be read with the R-function data.restore.
- Saving the workspace using ASCII is not useful for complex data, as the produced files are hard to access. (One can save such an ASCII workspace dump using save(…, ascii = TRUE) in R, using the command file.txt –ascii in Matlab, or using dump() in S-Plus)
- If there are important tables in the workspace, a copy can be saved as CSV-file.
- Vector graphics files will be harder to access over long term than bitmaps. Embedding of vector graphics into PDF files is not safe either. Files in special vector graphic formats, such as InDesign (*.indd) or Illustrator (*.ait), should also be saved as baseline TIFF, PDF/A-1b (see above), SVG or JPG file. You should carefully check the quality of produced files regarding contrast, resolution, colours, transparent objects, and text.
3. File format verification with DROID
For large data collections you can get an overview of your file formats using the free JAVA application DROID. Furthermore, this tool detects unknown file formats as well as inconsistencies between file extensions and file contents (figure 1).
With the exception of text files, files usually contain a special string of characters to indicate the file format. This character string is also referred to as signature or as magic numbers. If DROID finds a known signature within the file, this is used to determine the file type. In this case "Signature" or "Container" is indicated in the column "Method" (see figure 1). If the signature within the file is not consistent with the file extension, DROID shows a warning sign (yellow triangle with exclamation mark).
Pure text files (*.txt) or tables in text format (*.csv files) do not contain any signatures. DROID classifies such files by using the file extension. If there is no signature and the file extension does not indicate a text file, the file is not classified at all (both files at the bottom of figure 1).
The software tool docuteam packer is recommended and set up for some customers by the ETH Library. This tool detects files with unclear or unknown formats and produces a list comparable to that of DROID.
Figure 1: Screenshot showing DROID verification for some test files. Files with unclear or unknown file types can be easily detected.