The following recommendations apply when files are uploaded manually using the web interface of the ETH Data Archive (http://data-archive.ethz.ch/deposit).
The first section of this document explains how you may prepare your files and folders to ensure long-term readability of your data.
For archiving large collections of heterogeneous research data sets over a limited time period we currently recommend to pack the data into container formats. The second section of the current document explains how to create ZIP- or tar-containers and recommends suitable tools.
1. Data preparation
We recommend to carefully select the data, such that the archived data is of scientific relevance and worth archiving over the long term. Please remove unneeded data and avoid storing identical files in several places, such as storing ZIP-Files and their unzipped contents, multiple backups or temporary files. Private information does not belong into the ETH Data Archive.
Choose open formats
To allow for long-term readability of your files, non-proprietary file formats that follow open and properly-documented standards should be preferred. If you plan to archive your data for more than 10 years, it is recommended to convert unusual file formats into more popular formats. Please consult the fact sheet File Formats for Archiving for further information on this topic.
Avoid special characters
Avoid special characters in names of files and folders. These characters hamper compatibility because they lead to undesired effects that depend on the operating system.
Avoid the following characters:
- \ / ? : * " > < |
These characters are not allowed in Windows file names. If a folder is unpacked by WinZip, these characters are usually replaced by underscores.
- Non-ASCII characters, such as ¢ ™ ® ä ö ü à é ô and other characters with diacritics
If files are packed with WinZip, these files are moved to locations outside of their original folder due to a flaw in Linux.
The following ASCII characters are permitted:
Proper use of file extensions
File extensions (such as .txt, .pdf) should be consistent with the file format. Avoid saving files without file extensions or using special characters in the file extension.
Limit the lengths of file and folder names
Avoid overly long path lengths in your folder structure. Long file names combined with a detailed folder hierarchy may lead to path lengths exceeding 256 characters, which causes some issues for Windows users1. Such containers cannot be completely unpacked with WinZip. Effective path lengths are further increased when special characters are used in file names and when container files are unpacked within subfolders. We thus recommend using path lengths of less than roughly 200 characters.
2. How to package data into ZIP or tar archives
We currently recommend packing the data into ZIP or tar container files in order to archive large collections of heterogeneous research data sets in the ETH Data Archive (without active validation and preservation measures) over a limited time period. Using container files has the advantage that all files in an archival package are uploaded (and downloaded) in a single batch. Furthermore, the folder structure remains unchanged.
Despite using file containers, we strongly recommend preparing the data as described in the first section of this document. The data should be carefully selected and the contents should be documented. Furthermore, the used file formats should still be readable in 10 or 15 years.
Limit the length of file and folder names
Please consider that the original folder structure may need to be recovered from the container files in various operating systems. Therefore avoid overly long path lengths when organizing your data. Path lengths exceeding 256 characters hamper further processing in Windows, and WinZip fails to unpack such containers. See also the recommendations described in section 1.
Split large data packages
Large data sets can lead to difficulties uploading your data and also when data are downloaded using the viewer. We have no influence on several factors that cause these difficulties (such as browser and internet connection). Uploading data packages up to 15 GB is possible, but downloading packages of this size with a browser is usually not feasible. Therefore, we recommend using ZIP or tar files not exceeding a maximal size of 2 GB. If your archival package exceeds this size, please split it into meaningful subunits and use one ZIP or tar container for each subunit. You will then be able to upload all your container files in a single batch.
Please do not use the split feature of WinZip when splitting your data!
General comments on creating container files
- Only use archives with extensions .zip or .tar (do not use .7z, tar.gz, .rar, and so on).
- If you create ZIP archives, please zip your data without any data compression.
- Avoid encrypting your files.
Container formats and suitable software tools
We suggest selecting a format for your container files that is convenient to create on your operating system. On a Windows operating system you may generate ZIP containers whereas Mac OS users usually prefer creating tar containers.
The tar format is preferred for long-term archiving because it is an openly-documented format that does not depend on a single producer.
1 For file names, the lenght is limited by most operating systems to less than 256 characters.
2 Download is free of charge at http://7-zip.org/ (acessed 03.03.2015). Please contact your IT support.
3 Download is free of charge at http://www.kekaosx.com/en/ (acessed 03.03.2015). Please contact your IT support.