• The Research Collection is the Institutional Repository of ETH Zurich and a free service for ETH members.
  • It is aligned with the FAIR data principles according to SNSF guidelines (https://www.snf.ch/en/WtezJ6qxuTRnSYgF/topic/open-research-data-which-data-repositories-can-be-used).
  • The repository is accepted for publishing supplemental material by renowned scientific journals.
  • It supports linking the data to other publications in the Research Collection (Journal Articles, Conference Papers, ...).
  • It supports Web upload, DOI-reservation and registration, ORCID iDs and export to OpenAire.
  • Entries are preserved long term in the ETH Data Archive.
  • All types of data can be published in the Research Collection (taking into accounta any ethical and legal restrictions).
  • The Research Collection offers the following publication types for research data: Data Collection, Dataset, Image, Model, Software, Sound, Video and Other Research Data.
  • We do not recommend uploading files larger than 10 GB in the Research Collection. Each entry shouldn't exceed a total of 50 GB. Several entries can be linked to a parent Data Collection.

How to prepare your research data for publication


  • Only research data to which you hold the rights or have consent from the rights holders can be published in the Research Collection
  • Research data has to comply with the standards set out in ETH Zurich’s Compliance Guide and the Guidelines for Research Integrity at ETH Zurich
  • All software (code) including scripts that is going to be published in the Research Collection has to be registered with ETH transfer and licensed under an open source license. The license has to be included in the upload. For more information consult Licensing open source software and scripts


Contacts: 

  • We recommend choosing a Creative commons licence. These licences allow authors / data producers to define what types of reuse are permitted for their works. Data published without a licence can only be reused with explicit permission from the copyright holder or based on an exemption in national copyright law. Works published with a CC license can be reused as specified in the license.
  • For help in choosing a Creative Commons license check the page Creative Commons Licenses.
  • Creative Commons Licences are inappropriate for Software and open source licences are unsuited for research data. For licensing software visit "Licensing open source software and scripts"
  • Include a README file to document your data. A guide can be found here: https://documentation.library.ethz.ch/x/bQBIB
  • Use meaningful file and folder names
  • Include metadata
  • Link your item to an article or other publication (example)
    Example of linked item
  • Remove temporary and backup files
  • Remove duplicate files
  • Remove personal information
  • Rename files and folders where helpful (meaningful names)
    • Avoid overly long folder and file names. Total path lengths >200 characters (files and folders combined) can lead to problems for windows users 
  • Remove third party files and software for which you don‘t have permission
  • Check for hardcoded file paths, symbolic links, references
  • Don‘t include your manuscript: Publishers PDFs, Preprints and postprints (Author's Accepted Manuscripts) should be published as a separate entry. See Self-archiving
  • Spell check your text files
  • File extensions should be consistent with file formats 
  • Avoid special characters in names of files and folders. These characters hamper compatibility because they lead to undesired effects depentding on the operating system
    • Avoid the following characters:
      • \ / ? : * " > < | : # % " { } | ^ [ ] ` ~ as well as blanks
      • Non ASCII characters such as ¢ ™ ® , umlauts (ä ö ü), diacritics such as à é ô etc.
    • The following ASCII characters are permitted:
      • ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789
    • We are currently not aware of problems with the folllowing characters:
      • ! $ & ' ( ) + , - . ; = @ _ 
  • Choose open, well documented standards from your domain
  • For long term preservation (> 15 years) only formats from our list of recommended file formats should be used. Other Formats should be converted. Consult our list of File formats for archiving research data - draft
  • When non-recommended formats are used, long term readability cannot be guaranteed
  • Upload single files directly
  • Pack file collections containing a large number of files or subfolders into a container
    • Use standard container formats: Zip or TAR-files (avoid .7z, tar.gz, .rar, and so on)
    • Use preferably uncompressed containers (compression level “store”)
    • Don't use encryption or password protection
  • Large datasets can lead to problems with up- and downloads of your dataset. Limit single files to around 10 GB for a limit of 50 GB per entry.
    • Please split larger folder structures manually into meaningful subunits and package them separately.  Don't use the automatic split features of your software. 

Instructions for Windows

On Windows operating systems, you should create ZIP archives by using the software tool 7-Zip.

You start by selecting your files and folders in Windows. Then, click your right mouse button to open a menu. Select the software tool “7-Zip”, and “Add to archive …” A dialog box opens as shown below. In the white field on top, you write the name of your archive file. Use the option “zip” and the compression level “store”.

Intructions for macOS

On a Macintosh computer, you should create tar archives.

You may create tar container files either by using the command line (tar -cvf <archive_name.tar> <folder_to_tar>) or by using the software Keka. If you choose the second option, you start the program Keka and select “Compression” in “Preferences”. You then select the default format “TAR” as in the dialog box shown below. You finally drag your folder onto the Keka icon and fill in the name of your archival file. You should select the option "Exclude Mac resource forks (e.g. .DS_Store)" in the Keka settings. This prevents the inclusion of macOS specific hidden files. 

How to upload your data in the Research Collection




  • If you would like to reserve a DOI before uploading your data follow the instructions on Reserving a DOI
  • The Research Collection offers different access rights for research data items: Open Access, Closed Access, Embargoed Access, ETHZ Users only, Selected Users only and Closed access. For more information check our page Access rights
  • After you finish your submission your item is still in a review state: Metadata is visible but files are not accessible
  • Within a few work days the data is reviewed by our team.
    • We check and supplement the entries metadata
    • We identify and add the used formats to the metadata.
    • We inform users of issues with long term preservation (depending on the selected retention period)
    • We perform a cursory check for obvious legal issues (copyrighted material, software licenses) → Submitters are responsible for the contents of their upload!
    • There is no content review
    • After the review you will be informed of the finalisation of your item
    • The DOI will be registered overnight


  • Research output from cooperation projects can be deposited in the Research Collection independent of where it was produced, as long as an ETH group or institute takes over the responsibility for obtaining the publication rights from the data producers.
  • Please consider the following when uploading the data:
    • Field Organisational Unit: Data that was not produced at ETH must use the organisational unit of the ETH group or institute that was part of the cooperation.
    • Field ETH Publication: Data that was not produced at ETH Zurich must be marked with a "No".
    • Field Project/Grant: Please choose the ID/name of the cooperation project / grant, in order to prevent unnecessary enquiries from our staff about the origin of the data.
  • The ETH Library forwards access request to the rights holder of a dataset - even if he/she is no longer affiliated with ETH Zurich - under the condition that the rights holder has an up-to-date ORCID record and his/her ORCID iD has been assigned to the corresponding author record in the Research Collection.
  • In addition, you might want to consider assigning the task of a “data steward” to someone in your group. This person can approve access requests for data (even if the data producer has already left the group) and edit data that has not yet been published. If you would like to determine a data steward for your group, please contact research-collection@library.ethz.ch.

File formats for archiving research data - draft

1. Assessment of various file formats

Table 1: Our assessment of future readability of some common file formats. (For more detailed information we refer to the recommendations of the Bundesarchiv (German), the KOST (German or French), the Memoriav, the Forschungsdatenzentrums Archäologie & Altertumswissenschaften IANUS (Germany), the Library of Congress and the Harvard Library.)

File type

Recommended

Suitable to only a limited extent

Not suitable for archiving

Text
  • PDF/A (*.pdf)
  • Plain Text (*.txt, *.asc, *.c, *.h, *.cpp, *.m, *.py, *.r etc.) coded as ASCII, UTF-8, or UTF-16 using byte order mark
  • XML (inclusive XSD/XSL/XHTML etc.; with included or accessible schema and character encode explicitly specified)
  • PDF (*.pdf) with embedded fonts

  • Plain text (*.txt, *.asc, *.c, *.h, *.cpp, *.m, *.py, *.r etc.) (ISO 8859-1 coded)

  • Rich Text Format (*.rtf)

  • HTML and XML (The ASCII text is readable over long term; try to avoid external links.)

Not accepted for publication, OK for supplementary materials:

  • Word *.docx

  • PowerPoint *.pptx

  • LaTeX, TeX (The ASCII text is readable over long term; open source software required for formatting and the resulting PDF should be included.)

  • OpenDocument formats (*.odm, *.odt, *.odg, *.odc, *.odf)

  • Word *.doc
  • PowerPoint *.ppt

Spreadsheet or table

  • Comma- or tab delimited text files (*.csv)
  • Excel *.xlsx (container format)
  • OpenDocument spreadsheets (*.ods)
  • Excel *.xls, *.xlsb (binary formats)
Raw data and workspace
  • ASCII Text is suitable for long-term use, but the data import may be time-consuming.
  • S-Plus files (*.sdd) may be saved as text files.
  • Matlab *.mat files may be saved in HDF Format. Saving nontrivial ASCII Matlab *.mat files should be avoided because they are not readable with the Matlab load command (see table 2).
  • Network Common Data Format or NetCDF (*.nc, *.cdf)
  • Hierarchical Data Format (HDF5) (*.h5, *.hdf5, *.he5)
  • Binary files such as the standard Matlab files *.mat or the R files *.RData
Raster image (bitmap)
  • TIFF (*.tif) (uncompressed, preferentially TIFF 6.0, Part 1: baseline TIFF). TIFF is preferred as compared to PNG or JPEG2000.
  • Portable Network Graphics (*.png, uncompressed)
  • JPEG2000 (*.jp2, lossless compression)

  • Digital-Negative-Format (*.dng) to keep raw data of digital fotos in addition to an second copy in TIFF format
  • TIFF (*.tif) (compressed)
  • GIF (*.gif)
  • BMP (*.bmp)
  • JPEG/JFIF (*.jpg)
  • JPEG2000 (lossy compression) (*.jp2)

Vector graphics
  • SVG without JavaScript binding (*.svg)

  • Graphics InDesign (*.indd), Illustrator (*.ait)

  • Encapsulated Postscript (*.eps)

  • Photoshop (*.psd)
CAD
  • AutoCAD Drawing (*.dwg)
  • Drawing Interchange Format, AutoCAD (*.dxf)
  • Extensible 3D, X3D (*.x3d, *.x3dv, *.x3db)


Audio
  • WAV (*.wav) (uncompressed, pulse-code modulated)
  • Advanced Audio Coding (*.mp4)
  • MP3 (*.mp3)

Video1)
  • FFV1 codec (version 3 or later) in Matroska container (*.mkv)
  • MPEG-2 (*.mpg,*.mpeg)
  • MP4, which is also called MPEG-4 Part 14 (*.mp4)
  • QuickTime Movie (*.mov) 2)
  • Audio Video Interleave (*.avi)
  • Motion JPEG 2000 (*.mj2, *.mjp2)
  • Windows Media Video (*.wmv)

1) In addition to the file format (or container format), also the codec and the compression method are important. See Ianus, Memoriav and KOST for further information.

2) In the Version of Nov 21, 2018 of the current document, the format QuickTime Movie was downgraded from „Recommended“ to „Suitable to only a limited extent“. Apple discontinued the support of Windows QuickTime Player in the year 2016. Windows Media Player thus only supports file format versions 2.0, or earlier, of QuickTime Movie files.

1.1 Suitable to only a limited extent

If you plan using your data for up to ten years we recommend the formats in the middle and the left column of Table 1. Even less known formats that are common in your area of expertise for this type of data are usually suitable.

You should also consider the following points:

  • Files in rare formats should be converted into common formats whenever possible. You should archive the original file and the converted file.
  • The files should not be dependent on references to data, templates, fonts, or programs stored elsewhere, but instead, such objects should be archived too. If this is not possible, you should describe the existing dependencies on other files or programs in a plain text file ("readme"). You then archive the readme file together with the data.
  • Files should not be password protected, encrypted or compressed. If you absolutely need to encypt data, please configure access rights such that data can still be opened after your departure.
  • Use only letters, numbers, underscore (_) and hyphen (-) for naming folders and files. Avoid spaces, slashes, and other special characters. For more infomation see this guidline.
  • The file extension should be consistent with the actual file format.

1.2 Recommended file formats

For storage over more than ten years, we recommend file formats in the left column of Table 1, such as PDF/A, ASCII text, and TIFF. Also PNG, SVG and JPEG2000 may be appropriate. Bear in mind that the future readability of a file will also strongly depend on the used file features: Reading fancy features of a format, such as video data within a PDF file, will be less reliable than reading basic features.

To use files for more than ten years, the file formats should be very common and, if possible, follow standards that are open and not proprietary. Nevertheless, it cannot be guaranteed that your data will remain readable over the long term, as this depends on future software developments.

The ETH Library will review the archived file formats regularly and will attempt to convert outdated formats into more common formats. The original file will always be kept.

2. Recommended conversion methods

We recommend the conversion methods shown in Table 2. Useful conversions also depend on the type of information that is stored in the files. You may store your Excel spread sheets in *.csv files, but if the Excel file contains also macros, equations or embedded objects, this information will be lost.

You should check the quality of your converted files. The original and the converted files should be archived.

Some more recent file types (*.docx, *.xlsx, *.pptx) are so-called container files. By attaching the file extension “.zip” to the file name you can check the single file components. You may also save such simpler files separately.

Table 2: Recommended file conversions

File typeRecommended conversions
Text
  • Word and PowerPoint files should be converted to PDF/A-1b files. According to our tests the following method for converting Microsoft Word or Microsoft PowerPoint files usually leads to acceptable results: Open the file using Word or PowerPoint, click “File” and then “Print” to open the Print dialog box. Select the printer „Adobe PDF“. Click on „Printer Properties“ and there select „PDF/A-1b: 2005 (RGB)“. Then click on the button “Print”. See also the instruction on Creating PDF Files.
  • LaTeX (or TeX files) should be converted to PDF/A files and both versions should be submitted.
  • You should carefully check the quality of your converted files. Verify equations, special characters, umlauts, special fonts, spelling errors, searching and selecting of text, tables, colours, transparent objects, comments, vector graphics and layered graphics.
Tables
  • Convert Excel *.xls files to *.xlsx files
  • You may save a copy of embedded objects (such as figures) as independent files.
  • Tables may be converted to ASCII text *.csv files: In Excel you may save sheets as *.csv files; in R you may save tables with “write.csv”; and in S-Plus you may use „write.table“ to save as *.sdd files.
Workspace Dump in Matlab, R or S-Plus
  • Matlab *.mat files should be saved as v7.3 files (using save -v7.3 x.mat), as the produced *.mat File follows a HDF5-based standard. (HDF5 is an open standard for tables, media data and complex data structures.)
  • The R workspace should be saved with the R-package rhadf5 in a HDF5 format. The S-Plus function data.dump produces a file that can be read with the R-function data.restore.
  • Saving the workspace using ASCII is not useful for complex data, as the produced files are hard to access. (One can save such an ASCII workspace dump using save(…, ascii = TRUE) in R, using the command file.txt –ascii in Matlab, or using dump() in S-Plus)
  • If there are important tables in the workspace, a copy can be saved as CSV-file.
Graphics
  • Vector graphics files will be harder to access over long term than bitmaps. Embedding of vector graphics into PDF files is not safe either. Files in special vector graphic formats, such as InDesign (*.indd) or Illustrator (*.ait), should also be saved as baseline TIFF, PDF/A-1b (see above), SVG or JPG file. You should carefully check the quality of produced files regarding contrast, resolution, colours, transparent objects, and text.


3. File format verification with DROID

For large data collections you can get an overview of your file formats using the free JAVA application DROID. Furthermore, this tool detects unknown file formats as well as inconsistencies between file extensions and file contents (figure 1).

With the exception of text files, files usually contain a special string of characters to indicate the file format. This character string is also referred to as signature or as magic numbers. If DROID finds a known signature within the file, this is used to determine the file type. In this case "Signature" or "Container" is indicated in the column "Method" (see figure 1). If the signature within the file is not consistent with the file extension, DROID shows a warning sign (yellow triangle with exclamation mark).

Pure text files (*.txt) or tables in text format (*.csv files) do not contain any signatures. DROID classifies such files by using the file extension. If there is no signature and the file extension does not indicate a text file, the file is not classified at all (both files at the bottom of figure 1).

The software tool docuteam packer is recommended and set up for some customers by the ETH Library. This tool detects files with unclear or unknown formats and produces a list comparable to that of DROID.


Figure 1: Screenshot showing DROID verification for some test files. Files with unclear or unknown file types can be easily detected.




  • No labels