The Shapefile 2.0 Manifesto
Saturday April 18 2009 at 07:50
Over at Alex Willmer’s Misspelled nemesis club there has been a (now-closed) discussion on the successor to the ubiquitous shapefile.
Alex points out a lot of the benefits of the shapefile (and there certainly have been many). Before we discuss its “successor”, we should reflect on our business needs and the problems we had with the original 1.0 version so that, having learned and remembered, we might create something better.
In taking my use of shapefiles over the years as an articulation of “business needs”, I see two, specific and distinct, areas of need:
- Data exchange;
- Data storage for read only data.
I have never used shapefiles for data editing of business data in a shared edit environment. Anyone who did so, or does so, (sorry) is mad.
I have many problems with shapefiles for data exchange.
- I have been emailed just the SHP file (in a fire fighting situation);
- The truncated attribute names made reconstruction of the data a pain;
- PRJs just don’t seem to be uniformly associated with shapefiles;
- The outer and inner rings of polygonal data are an inconsistent mess of orientations.
- The data needs some software that understands shapefiles to process – if the data is textual, or XML-based, it is amazing what tools one can bring to access data from within the document. (Though I have opened a DBF file in a text editor and extracted data!)
I’m sure there are more reasons but these are a pretty reasonable summary of the issues for exchange as I see it.
Any business, agency or government department worth its geo-salt has large amounts of read-only data sourced from outside. It is an important element of the GIS Manager/Administrator’s arsenal to have a high-speed geodata-storage format on hand to present this data to the GIS and IMS software. For this I have used shapefiles on many occasions.
But, you know, a big pain in this arena was that, once stored in a shapefiles, unless I had ESRI software, I could not create the SBN/SBX spatial index files (that only ESRI software could read anyway)!
Anyone else note the lack of “donation” of the structure of the spatial index to the public domain in the shapefile specification?
This lack of donation of a critical element of a format highlights a big limitation in a vendor donating any format (or API) to the public domain (as in some sort of “look at me” act of philanthropy).
I also have issues with the shapefile as a format in other areas as I have articulated in presentations I have given on databases vs file formats. Here is a summary on shapefiles from a recent presentation:
- Rigid format that is locked to particular application/use.
- Change in file format requires all programs that access it to be modified and recompiled.
- No accessible metadata
- Attribute name restrictions (this could not be fixed as the DBF “standard” is not modifiable via a transparent public process – that I know of);
- Data consistency
- Attributes in shapefile DBF are not self-checking via integrity rules that are application independent.
- What is the definition of a correct shape?
- Bad shapes in shapefiles are very common!
- No security.
- Multi-user access limited to read only at best (files are often locked to specific applications).
- Cannot update a shapefile that is being read by another application!
- Access limited to proprietary systems that understand structure.
- Not all the shapefile format is published eg spatial index files. Why?
- File format created to satisfy particular functions.
- New requirements needs new programs to create/access.
- End user not empowered to modify structure eg topological shapefile!
- Lack of alignment with industry standards (Support for OGC curves)
- Size Limitations: Often 32bit programming limits or assumptions stops file size or links between files being greater than 2G
If I reflect on Alex’s three suggestions, then perhaps I might have my own answer to the question.
- File Geodatabase;
- Spatial Data Format (SDF);
First off, if you are transactionally editing anything in your business/organisation then you use a database. And I don’t care what database you use: the decision as to what database is never made about “best” but about what is “incumbent” within an organisation.
Would I use SpatialLite as a replacement for a shapefile?
SQLLite is a database and SpatialLite is a spatial type for that database. I am not sure that I would build a data exchange and read-only access system based on a database, because, perhaps, it is too complex. But, as the SpatialLite web page points out:
”[...] each SQLite database is simply a file; you can freely copy it, compress it, send it on a LAN or WEB with no complication at all”.
Added this bit of excellence is the fact that databases are meant to be “self-referential” via all its data rules being encoded within it (ie primary and foreign keys, check, unique and table etc constraints etc). As such I would prefer to get a fully specified single-file SQLLite database (with spatial data and indexing via SpatialLite) than any geospatial vendor’s proprietary data file format.
One final think I like about SpatialLite is that its creator, Alessandro Furieri, has not attempted to write, from scratch, his own database management system. He has sensibly looked at the open source community and found the best for his (and our) purpose: SQLLite. Well done. Because of this he doesn’t have to specifically write ODBC etc drivers for SQLLite just extend those that exist (in the open source community). This bodes well for this style of data access for SQL clients.
Would I use ESRI’s “File Geodatabase”? Despite all that Scott Morehouse says in his comment, we have had many, many years of experience with shapefiles, and, reflecting on that experience, all I could say is that, unless ESRI donated the File Geodatabase format and all their non-ArcObjects APIs to the open source community via a suitable license (I like Creative Commons by Attribution but that is probably not suitable for something like this) which guaranteed that ESRI only had one voice in any decisions made to the format, then I would want to have nothing to do with any proprietary format given to the public domain by ESRI. Note that, in Scott’s comments, at no stage does he indicate that such an API or the File Geodatabase physical file format will be released to the public domain. This is a show-stopper for File Geodatabases as the new Shapefile 2.0.
Finally, wrt File Geodatabase, Scott talks a lot about its complexity and semantic richness. Sure one needs this for data editing in a transactional environment (preferably via a properly constructed database that is aligned to computer science data management principles – and not a single vendor’s view on spatial data management), but who need this for simple file exchange? Vale, File Geodatabases!
Would I use SDF? Certainly we can read/write the format via FDO (which gives Autodesk some boasting rights wrt open source), but an access API is, perhaps, only half the problem. How can we be sure that SDF doesn’t have an spatial index or some other limitation that only the Autodesk FDO SDF provider can use? From the Wikipedia article on SDF we can see that:
_“The SDF format design uses low-level storage components of SQLite using a flat binary serialization (binary large objects). [That] is a single-user Geodatabase file format developed by Autodesk. [...] The current format version is SDF3 (based on SQLite3), which is a single file.”_
So, SDF passes the “single file test” but, worryingly:
_”[...] the relational aspects are not present, thus the format cannot be opened with any software designed specifically for SQLite. [...] Beyond Autodesk’s products, the only product that can read/write the format is FME from Safe Software.”_
I can’t help thinking: what is being hidden from us?
Revisiting SpatialLite we can see that it is not limited by the above restrictions, thus I think SpatialLite is a better choice for our purpose than (a partially hidden and crippled) SDF.
For data exchange, I don’t have a problem with GML (though I see very few valid instance documents) per se, but the handling of attributes is a real problem. Why? If the data is from one agency and that agency has not created a DTD or XML Schema to cover the attribute data, OR, there is a schema but they fail to provide it to the agency to whom the data has been sent, problems arise. At least with DBF files there is no such question! Can GML documents be used for read-only high performance storage of external spatial data? Even if it is compressed with a suitable standards compliant compression mechanism, we do not have a standards-based spatial index on which to base the spatial indexing (thus, like shapefiles we can end up with alternate spatial indexing systems)?
In the end, does it really matter about an exchange data format? To be honest, I think the geospatial industry is too fixated on creating its own industry-specific binary formats. (Or its ability to obfuscate things by turning something simple into something complex.)
Can we learn anything from the database “design pattern” of “logical abstraction from physical implementation”? I think we can. But firstly let’s look at Feature Data Objects and Scott’s comments – _“At ESRI, we are working on a low-level (non ArcObjects-based) API for the file Geodatabase.”_. Sorry, Scott, but Autodesk has the drop on you here with regards to open source APIs for accessing spatial data via FDO. Why doesn’t ESRI write an FDO provider for their data formats? (AFAIK MapInfo have done so as yet for its formats.)
Is this a “control” thing. Note, however, that if a user of your products zips up a file Geodatabase (have they gotten all the files?) and gave it to another business without ESRI software, what is wrong with that business using a free FDO provider to access the data (err one that implements access to any spatial indexing!)? Or are vendors trying to get software sales via a locked-up data format (a geo-Trojan horse)?
At least FDO provides a logical platform for implementing that which we need and cleanly breaks the nasty hard connection between file formats and applications. While I like FDO I also dislike it. Personally, I would rather see vendors making SQL drivers (implementing SQL/MM and OGC SFS SQL access standards) available for any proprietary data formats they believe they need to create. Manifold GIS does this for its *.map* file format (though they do not make the driver available for download): perhaps they will and show the world the *> way to do it.
But even with FDO we have a problem: it is, primarily, a geospatial programmers’ solution to data access. I cannot access any FDO spatial data format except through an application that has bound it in to its architecture. I would love to be able to do the following (on Windows) for non-database spatial data formats (for then I will not have to worry about any internal file storage issues):
I am sure that the jury is still out on a suitable format for the function being discussed by those would will make these decisions. Call me cynical, but I’ve been around vendors, standards and the industry too long to see any quick resolution to this issue.
However, this discussion and article have been useful because I will now give a long and serious look at the SpatialLite/SQLLite combination because, from this cursory examination it seems to fulfill all my requirements for Shapefile 2.0.