Thursday, April 21, 2016

Archiving raster data: choosing a raster and compression file format

I'm going to be generating a large number of raster datasets in an upcoming project, and I want to archive the data such that it is as openly accessible as possible. As well as choosing a place to store the data (a future post on this methinks) I need to decide the on the file formats that I use. There seem to be two file format considerations here: the raster file format, and the compression file format.

I have experienced not being able to make use of shared data in the past due to the data being stored in a redundant software file format. Therefore, I have developed a preference towards using simple plain text file formats for data archiving. Plain text files may not be the most efficient file format for actually analysing the data, but when archived in plain text, the data can always be accessed and converted into something more suitable to an individual workflow.

The Esri ASCII raster (.asc) format is a non-proprietary plain text way of storing raster data. I've been using .asc files for years as they are supported by lots of GIS programs such as ArcGIS, QGIS, and IDRISI, and can be easily imported into Python and R. A downside is that .asc files do not contain coordinate system information, so .asc files do need to be accompanied by supporting information that describes the coordinate system used.

As I'm concerned about archiving a large number of raster datasets, file compression is also going to be important. The archived files need to be compressed as much as possible so that requirements for both the storage space and file transfer and downloads and be reduced. Having looked around on the web for advice, there is a wide variety of different views as to what is the best file compression format. So I explored the efficiency of .asc file compression for what seemed to be some of the more popular approaches. As my concern is the archiving of open data, I did not consider factors such as compression speeds or encryption, all I was concerned about was the compression ratio:

My results would suggest that for an .asc file, the .7z compression format provides the best compression ratio. Use of the .7z format also has other advantages for archiving open data. The .7z compression format is open-source, which means that it is supported by a wide range of compression software, and the freely available 7zip software can be used on any computer operating system. This means that potential users of the data should always be able to easily decompress the data once they have received it.

No comments:

Post a Comment