Data Structures in GIS

There are five kinds of data to be represented in a GIS, see figure 1.


Point features, e.g. location of soil samples, boreholes, manholes, rain gauges, burst water mains, pumping stations, trees, buildings. The points consist of a number of nodes with no thickness and is often referred to as zero dimensional. One method to store a point feature in a GIS is as a table in the data base management system
Point IDX coordinate Y CoordinatePointer to attribute data
130123.619782.4 Point 1
230167.319745.7 Point 2
334952.219648.1 Point 3

Where the pointer to attribute data is a link into another data base table when other data about that point is kept, for example that it represents an access chamber to a sewer system and that it has properties such as date of construction, condition, size, material etc. This is a link into a full data base management system so further relations are permitted from this point

Linear features, e.g. roads (on small scale maps), rivers, pipe lines, power lines, elevation contours. The nodes are linked with arcs, each with a number of vertices (the simple arc is a straight line). Between vertices the arc is usually considered a straight line but curved links are possible. Line data can either be non-branching lines, or tree or network structures. In a network there are more than one routes between two nodes. This data has one dimension, that is, it does not have thickness and care must be taken in the definition of the system that a loop is not confused with a polygon. A simple structure for a line feature or network is:
line reference
Attribute Pointer
arc 1
arc 2
---
etc
Arc referenceX coordinate Y Coordinate
Node 130123.619782.4
Node 230167.319745.7
Vertex 330952.219648.1
etc

Areas (polygons) with common properties, e.g. pressure zones, catchments, contributing areas, soil association mapping units, climate zones, administrative district areas, buildings and other land cover. The polygon consists of a number of arcs or linear features that form a closed loop without crossing over one another. The arcs are usually straight between vertices but may be curved.
Polygon ReferenceAttribute Pointer
3Point 1
X coordinateY Coordinate
30123.619782.4
30167.319745.7
30952.219648.1
30123.619782.4

Simple polygon structure.

The simple polygon representation shown above where a quadrangle is represented, as used in CAD (or DXF format), is of little use in GIS. The 3 major problems with simple polygons are:

  1. The boundary between 2 polygon needs to be stored twice. There is always a possibility that the nodes for each boundary polygon are in slightly different positions resulting in artificial gaps between polygons, or slivers where an area is assigned to two or more polygons (see figure 3). These problems are particularly acute when manually digitising.

  1. When manually digitising it is possible to accidentally cross over from one polygon to another creating a totally false polygon or to pass from one node to another in incorrect order giving rise to a weird polygon (see figure 4)

  1. Complex geographical objects are difficult to represent, for example islands or disjointed polygons (see figure 5). If we consider an example from urban drainage where a garden area is completely surrounded by car park, as, for example, at a prestige office complex, then if we calculate the area of simple polygons the area of the car park will include, erroneously, the area of the gardens and grossly overestimate the impermeable area.

In GIS therefore area data is represented as topological structure in one of a number of ways. The Arc/Info method of storing this information is shown in figure 7. A separate list is used to hold information about islands and disjointed structures. Different themes can be represented on the same coverage and there is no requirement that polygons do not overlap. For example a single coverage may contain polygons representing landcover, whereas another other polygons may contain the contributing areas to inlet nodes of a storm water drainage system, see figure 6. The polygons naturally overlap and the intersections of these polygons provides one of the main uses of GIS and is known as overlay to reflect the graphical process of overlaying one theme upon another.

Actual or potential surfaces, e.g. ground elevation, variation of mean annual temperature, spatial distributions of rainfall, population densities. These are discussed in detail in the section on the digital elevation model (DEM)

Temporal elements, e.g. changes in land use over time, changes to a pipe network, rainfall records or streamflow records. These are not well represented in current GIS technology, but newer object oriented GIS should make this more readily available

Raster Representation

Figure 6 shows two polygons intersecting. The numerical calculation required to calculate either the intersection or the join of the 2 polygons is quite intensive. The whole process is made much simpler if the polygons are all the same shape and size, preferably rectangular. This use of rectangular polygons is known as a cell, grid or raster representation and provides one of the simplest representations for GIS and spatial statistical modelling. Figure 7 shows the same polygon data represented as a vector and as a raster. Note that the individual cell values can be either numbers for computation, such as elevations or pointers to a database with further attributes.

The ease of programming raster GIS systems and low computational overheads makes them very suitable for natural or environmental modelling. The size of cells used in GIS modelling requires careful thought before data entry and modelling can begin. I have used cells of 1m square for urban drainage work where we were only interested in a small catchment and 250m square for land evaluation where we were studying the whole of Ghana.

There is always error in the representation of real world structures as small cells and it is important to realise the trade off between small cells that accurately represent the real world but carry a lot of computational overhead and large cells that are much more efficient but introduce large errors. Fortunately computers are getting more powerful and disk drives much larger every year so these problems become less important and we can select cell sizes to represent the natural variation we observe. For example a soil association boundary will never be known on the ground to better than 50m accuracy, therefore using any cell size less than 50m is pointless. My recommendations on cell size are as follows:
Data derived from 1:50 000 maps50m
Data derived from 1:10 000 maps10m
Data derived from 1:1250 maps1m
Any modelling with satellite remote sensing resolution of the sensor (often 30m)
Nation wide land evaluation250m
Studies involving geodemographics200m
Physically based rainfall runoff modelling 20-40m (it is debatable whether it is truly physically based at this resolution but this will allow realistic computation times)
Flood plane studies50m

Most GIS that use raster data have some means of compressing the data using either run length encoding, quad trees or any of the loss less schemes for computer graphics. Unless you intent to write your own modules and one of the big attractions of raster GIS is that you can write your own modules then, then the compression technique is irrelevant to the user. However, it does mean that raster GIS data bases can be as small as their vector counterparts.

With some raster GIS all overlays must be carried out with identically sized cells and all resampling must be carried out manually before the overlay modelling begins. With other GIS the resampling is carried out dynamically to either the largest grid size of all the overlays in the model or some user specified grid size.

Raster and vector GIS are traditionally compared and the author states his preference for one or the other, but most modern GIS have vector and raster components which can often be inter linked seamlessly. Many tasks are easier to carry out in each form, for example cadasteral work requires the accuracy and precision of a vector GIS, whereas determining the water requirements of a region can be best done using a raster representation.

GIS software comes in a variety of packages. The two main types, as already described, are the vector based system and the raster based system. More modern systems permit the total integration of raster and vector data, allowing the advantages of both methods to be enjoyed, with few of the disadvantages.

Vector systems are often supported by traditional DataBase Management Systems (DBMS). The most common conform to the relational model, see Avison (1992). Arc-Info, the most widely used vector GIS package, follows this approach, Info being a relational DBMS in its own right The relational model is the basis of most DBMS used in organisations and businesses. This underlies the vector model's principle use as an asset or resource inventory system. A DBMS should allow access to appropriate parts of the database to different types of user, and prevent unauthorised viewing or changing. It should also maintain data concurrency, provide archive facilities and present a simple interface to the user for manipulating the data

Raster systems generally do not employ such strict data management. They have developed from image processing systems and are often used by a single user. Clearly these are generalisations, and many packages will embody aspects of both systems.

The most up-to-date systems are described as 'object oriented'. The distinction of object oriented systems is that all data items are described as being of one or more object type; e.g. a linear feature, a point, a vector polygon, a regular raster, a raster cell, a TIN, a DEM, etc. In addition to storing the description of the object, the methods of displaying, plotting and general manipulation are also carried with the object type, this is known as encapsulation.

Objects are hierarchical; rivers, roads and pipes will be objects that are descended from the linear object, each will, therefore, have the properties, behaviour and methods inherited from the linear feature, such as length. However they will each have behaviour and properties that are distinct; roads will have classes (i.e. 'A' roads and motorways); pipes and roads will not be able to connect to form a network.

The object oriented paradigm is currently of great interest to the computer science community. Object oriented programming languages, databases and, of course, GIS are under development, (see Worboys et al, 1990). There are several advantages that are stressed by advocates of the object oriented approach;

(i) it is intuitive as people naturally think in terms of objects;

(ii) by specifying behaviour, inconsistencies in the database can be reduced, for example sewers and water mains objects exhibit different behaviour and should not be part of the same network;

(iii) developing applications is easy; by having a hierarchical structure new objects are easily created.

There are a variety of ways of storing geographical data and different ways of processing the data. The choice of data structure is largely dictated by the use the data is to be put to, the capabilities of the GIS being used and, to a large extent by the existing data formats .


Figure 7(a) Simple vector representation, using the topologic model presented by Dangermond (1982), more complex structures are used to improve access times. (b) Raster representation, a raster layer is required for each attribute to be represented.

References

Avison DE (1992), Information Systems Development A Database Approach, 2nd Edition, Blackwell Scientific Publications.

Bradbury PA, Lea NJ and Bolton P (1993), Estimating Catchment Yield: Development of the GIS-based Calsite Model, Report OD125, April 1993, HR Wallingford.

Burrough PA (1986), Principle of Geographical Information Systems for Land Resources Assessment, Clarendon Press Oxford.

Carter (1989), On Defining the Geographic Information Systems, Fundamentals of Geographic Information Systems: A compendium, edited by Ripple WJ, pp3-6.

Dangermond J (1982), A Classification of Software Components Used in Geographic Information Systems, Proc. US - Australia Workshop on the Design and Implementation of Computer Based Geographic Information Systems, Honolulu Hawaii, pp70-91.

Elgy J, Maksimovic C and Prodanovic D (1993), Using Geographical Information Systems for Urban Hydrology, International Conference on Application of Geographical Information Systems in Hydrology and Water Resources, Vienna, Austria.

Lillesand TM and Kiefer RW (1987), Remote Sensing and Image Interpretation, 2nd Edition, Wiley.

Sibson R (1978), Locally Equiangular Triangulation, The Computer Journal, v21 n3, pp243-245.

Siyyid AN (1993), The use of METEOSAT data for rainfall/runoff modelling, PhD. Thesis, Aston University, May, 1993.

Worboys MF, Hearnshaw HM and Maguire DJ (1990), Object-Oriented Data Modelling for Spatial Databases, Intention Journal of Geographical Information Systems, v4 n4, pp369-383.