In modern applications, there is an increasing need to use spatial data. Just think of the services we use on a daily basis such as Google Maps to search for places of interest and get directions, or hotel reservation sites that show us their location on a map. For this type of data, it is necessary to use a spatial database.
There are several spatial databases of which the most famous is certainly PostGIS. But what makes a normal database a spatial database? Essentially, spatial databases store and manipulate spatial objects like any other database object.
Beware!!! It must be borne in mind that a spatial database management system can be used in applications other than geography. Spatial databases are used, for example, to manage data on the anatomy of the human body, large-scale integrated circuits, molecular structures and electromagnetic fields.
In this article, we will briefly outline the evolution of spatial databases, and then review the three aspects that associate spatial data with a database:
- Spatial data types refer to shapes such as point, line and polygon;
- Multidimensional spatial indexing is used to efficiently process spatial operations;
- Spatial functions, defined in SQL, useful for querying spatial properties and relationships.
The combination of spatial data types, indices and functions provides a flexible structure to optimise performance and analysis.
Brief history
In first-generation GIS implementations, all spatial data were stored in files that had to be interpreted and manipulated using special GIS software. These first generation management systems, still used today as QGIS, were designed to meet the needs of users where all required data are within the user’s organisational domain. These systems are usually proprietary and autonomous, built specifically to manage spatial data.
Second-generation spatial systems, on the other hand, store some data in relational databases (usually the ‘attributes’ or non-spatial parts), but lack the flexibility offered by direct integration.
True spatial databases came into being when people started treating spatial characteristics as database objects. Spatial databases fully integrate spatial data with a relational database. The orientation of the system changes from GIS-centric to database-centric.
Basic aspects of spatial databases
In the following, we analyse the three fundamental aspects that characterise a spatial database.
Spatial data types
A normal database handles various data types including strings, numbers and dates. A spatial database adds other (spatial) types to represent geographical features. These spatial data types abstract and encapsulate spatial structures such as boundaries and dimensions. In many respects, spatial data types can be understood simply as shapes. Spatial data types are organised in a hierarchy of types. Each sub-type inherits the structure (attributes) and behaviour (methods or functions) of its superior type. The hierarchy of basic spatial data types is shown below.
Spatial indices and bounding boxes
A normal database provides indexes to allow quick and random access to subsets of data. Indexing for standard data types (e.g. numbers, strings, dates) is usually done with B-tree indexes.
A B-tree subdivides the data using the natural sort order to place the data in a hierarchical tree. The natural order of numbers, strings and dates is simple to determine: each value is less than, greater than or equal to every other value.
But since polygons can overlap, can be contained within each other and are arranged in two-dimensional space (or more), it is not possible to use a B-tree to index them efficiently. Spatial databases provide a ‘spatial index’ that instead answers the question “which objects are within this particular bounding box (also called a bounding box)?”.
A bounding box is the smallest rectangle – parallel to the coordinate axes – capable of containing a given feature. But why are bounding boxes used? If we want to answer the question whether a polygon of arbitrary shape is contained within another polygon, also of arbitrary shape, the computation can be very cumbersome. In contrast, using rectangles greatly simplifies the computational cost. Even the most complex polygons and lines can be represented by a simple bounding box.
To be useful, indices must be fast. Therefore, instead of providing exact results, as B-trees do, spatial indices provide approximate results. The question “which lines are within this polygon?” will instead be interpreted by a spatial index as “which lines have bounding boxes that are contained within the bounding boxes of this polygon?”.
The spatial indices actually implemented by various databases vary widely. The most common implementations are R-Tree and Quadtree (used in PostGIS), but there are also grid-based indices and GeoHash implemented in other spatial databases.
Spatial functions
To manipulate data during a query, a normal database provides functions such as concatenating strings, performing hash operations on strings, performing mathematical operations on numbers and extracting information from dates.
A spatial database provides, in addition to traditional database functions, a complete set of functions for analysing geometric components, determining spatial relationships and manipulating geometries. These spatial functions form the basis of any spatial design.
Most spatial functions can be grouped into one of the following five categories:
- Conversion: functions that convert geometries to external data formats
- Management: functions that manage space table information and its administration.
- Recovery: functions that recover the properties and measurements of a geometry
- Comparison: functions that compare two geometries with respect to their spatial relationship
- Generation: functions that generate new geometries from others
The list of possible functions is extensive, but a common set of functions is defined by the OGC (Open Geospatial Consortium) via the SFSQL (Simple Features for SQL) standard.
PostGIS
PostGIS transforms the PostgreSQL database management system into a spatial database, adding support for the three features we discussed earlier: spatial types, spatial indexes and spatial functions.
PostgreSQL
PostgreSQL is a powerful relational database management system (RDBMS). It is released under a BSD-type licence and is therefore free and open source software. Like many other open source programmes, PostgreSQL is not controlled by a single company, but has a global community of developers and companies developing it.
PostgreSQL was designed from the beginning with type extension in mind, i.e. the ability to add new data types, functions and indexes at runtime. For this reason, the PostGIS extension can be developed by a separate development team, while integrating seamlessly into the core PostgreSQL database.
Why choose PostgreSQL?
A frequent question from those familiar with open source databases is: “Why wasn’t PostGIS built on MySQL?”.
The main reasons are due to the fact that PostgreSQL has:
- Proven reliability and transactional integrity by default (ACID)
- Careful support of SQL standards (full SQL92)
- Extension of types and functions
- Community-oriented development model
- No limit on column size (“TOAST” tuples) to support large GIS objects
- Generic index structure (GiST) to enable R-Tree indexing
- Easy addition of customised functions
Finally, PostgreSQL offers a very simple development path for adding new spatial types.
Because the development path for adding types to PostgreSQL was so simple, developers chose this database to handle spatial data. When MySQL released the basic spatial types in version 4.1, the PostGIS team took a look at their code and their analysis reinforced the original decision to use PostgreSQL.
Since the MySQL spatial objects had to be modified according to the string type as a special case, the MySQL code was distributed over the entire code base. The development of PostGIS 0.1 took less than a month. Building a system for MySQL would have taken much longer.
Why not files?
The Shapefile (and other formats such as Esri’s Geodatabase File and GeoPackage) is, even today, a standard way of storing and interacting with spatial data since the birth of GIS software. However, these ‘flat’ files have the following disadvantages:
- Files require special software for reading and writing. SQL is an abstraction for random data access and analysis. Without this abstraction, one must write all the access and analysis code oneself.
- Competing users can cause corruption and slowdowns. Although you can write additional code to ensure that multiple writes to the same file do not corrupt the data, when you have solved the problem and also the associated performance issue, you will have written the best part of a database system. Why not use a standard database?
- Complicated questions require complicated software to answer them. Complicated and interesting questions (spatial joins, aggregations, etc.) that are expressible in one line of SQL require hundreds of lines of specialised code to answer when programming with files.
Most PostGIS users create systems where multiple applications will need to access data, so having a standard SQL access method simplifies deployment and development. Some users work with large data sets; with files, these can be segmented into several files, but in a database they can be stored as one large table.
In summary, the combination of support for multiple users, complex ad hoc queries and performance on large data sets is what distinguishes spatial databases from file-based systems.
Brief history of PostGIS
In May 2001, Refractions Research released the first version of PostGIS. PostGIS 0.1 had objects, indexes and a handful of functions. The result was a database suitable for storage and retrieval, but not for analysis.
As the number of functions increased, the need for an organising principle became apparent. The Open Geospatial Consortium’s ‘Simple Features for SQL’ (SFSQL) specification provided such a structure with guidelines for naming functions and requirements.
With PostGIS support for simple spatial analysis and joins, Mapserver became the first external application to provide data visualisation in the database.
In the following years, the number of functions in PostGIS grew, but its power remained limited. Many of the more interesting functions (e.g. ST_Intersects(), ST_Buffer(), ST_Union()) were very difficult to code. Writing them from scratch meant years of work.
Fortunately, a second project arrived, the ‘Geometry Engine, Open Source’ or GEOS. The GEOS library provides the algorithms needed to implement the SFSQL specification. By linking to GEOS, PostGIS has provided full support for SFSQL since version 0.8.
As the data capacity of PostGIS grew, another problem emerged: the representation used to store the geometry proved to be relatively inefficient. For small objects such as points and short lines, the metadata of the representation had an overhead of 300 per cent. For performance reasons, the representation had to be slimmed down. By reducing the metadata header and the required size, the overhead was reduced significantly. With PostGIS 1.0, this new, faster and lighter representation became the default.
Recent versions of PostGIS continue to add features and performance improvements, as well as supporting the new capabilities of the core PostgreSQL system.
Who is using PostGIS?
There are several organisations and companies that use PostGIS. If you would like to have a complete list of case studies, you can consult the dedicated page of PostGIS. Below are a couple of the most significant cases.
Institut Geographique National, France
IGN is the French national mapping agency and uses PostGIS to archive the country’s high-resolution topographic map, ‘BDUni’. BDUni has more than 100 million features and is maintained by a staff of more than 100 people who check observations and add new maps to the database every day. The IGN installation uses the database’s transactional system to ensure consistency during update processes and a warm standby system to maintain a period of activity in case of system failure.
RedFin
RedFin is a real estate agency with a web-based service for property scanning and value estimation. Their system was originally built on MySQL, but they found that switching to PostgreSQL and PostGIS provided huge advantages in terms of performance and reliability.
PostGIS has become a widely used spatial database and the number of third-party programmes supporting data storage and retrieval has increased. The programmes that support PostGIS include both open source and proprietary software on both server and desktop systems. The following figure shows a list of some of the software that take advantage of PostGIS.