Shapefile
The shapefile format is a geospatial vector data format for geographic information system software. It is developed and regulated by Esri as a mostly open specification for data interoperability among Esri and other GIS software products. The shapefile format can spatially describe vector features: points, lines, and polygons, representing, for example, water wells, rivers, and lakes. Each item usually has attributes that describe it, such as name or temperature.
Overview
The shapefile format is a digital vector storage format for storing geometric location and associated attribute information. This format lacks the capacity to store topological information. The shapefile format was introduced with ArcView GIS version 2 in the early 1990s. It is now possible to read and write geographical datasets using the shapefile format with a wide variety of software.The shapefile format stores the data as primitive geometric shapes like points, lines, and polygons. These shapes, together with data attributes that are linked to each shape, create the representation of the geographic data. The term "shapefile" is quite common, but the format consists of a collection of files with a common filename prefix, stored in the same directory. The three mandatory files have filename extensions
.shp
, .shx
, and .dbf
. The actual shapefile relates specifically to the .shp
file, but alone is incomplete for distribution as the other supporting files are required. Legacy GIS software may expect that the filename prefix be limited to eight characters to conform to the DOS 8.3 filename convention, though modern software applications accept files with longer names.;Mandatory files :
-
.shp
— shape format; the feature geometry itself -
.shx
— shape index format; a positional index of the feature geometry to allow seeking forwards and backwards quickly -
.dbf
— attribute format; columnar attributes for each shape, in dBase IV format
-
.prj
— projection description, using a well-known text representation of coordinate reference systems -
.sbn
and.sbx
— a spatial index of the features -
.fbn
and.fbx
— a spatial index of the features that are read-only -
.ain
and.aih
— an attribute index of the active fields in a table -
.ixs
— a geocoding index for read-write datasets -
.mxs
— a geocoding index for read-write datasets -
.atx
— an attribute index for the.dbf
file in the form of shapefile.columnname.atx
-
.shp.xml
— geospatial metadata in XML format, such as ISO 19115 or other XML schema -
.cpg
— used to specify the code page for identifying the character encoding to be used -
.qix
— an alternative quadtree spatial index used by MapServer and GDAL/OGR software
.shp
, .shx
, and .dbf
files, the shapes in each file correspond to each other in sequence. The .shp
and .shx
files have various fields with different endianness, so an implementer of the file formats must be very careful to respect the endianness of each field and treat it properly.Shapefile shape format (.shp
)
The main file contains the geometry data. The binary file consists of a single fixed-length header followed by one or more variable-length records. Each of the variable-length records includes a record-header component and a record-contents component. A detailed description of the file format is given in the ESRI Shapefile Technical Description. This format should not be confused with the AutoCAD shape font source format, which shares the .shp
extension.The 2D axis ordering of coordinate data assumes a Cartesian coordinate system, using the order or. This axis order is consistent for Geographic coordinate systems, where the order is similarly. Geometries may also support 3- or 4-dimensional Z and M coordinates, for elevation and measure, respectively. A Z-dimension stores the elevation of each coordinate in 3D space, which can be used for analysis or for visualisation of geometries using 3D computer graphics. The user-defined M dimension can be used for one of many functions, such as storing linear referencing measures or relative time of a feature in 4D space.
The main file header is fixed at 100 bytes in length and contains 17 fields; nine 4-byte integer fields followed by eight 8-byte signed floating point fields:
Bytes | Type | Endianness | Usage |
0–3 | int32 | big | File code |
4–23 | int32 | big | Unused; five uint32 |
24–27 | int32 | big | File length |
28–31 | int32 | little | Version |
32–35 | int32 | little | Shape type |
36–67 | double | little | Minimum bounding rectangle of all shapes contained within the dataset; four doubles in the following order: min X, min Y, max X, max Y |
68–83 | double | little | Range of Z; two doubles in the following order: min Z, max Z |
84–99 | double | little | Range of M; two doubles in the following order: min M, max M |
The file then contains any number of variable-length records. Each record is prefixed with a record header of 8 bytes:
Bytes | Type | Endianness | Usage |
0–3 | int32 | big | Record number |
4–7 | int32 | big | Record length |
Following the record header is the actual record:
Bytes | Type | Endianness | Usage |
0–3 | int32 | little | Shape type |
4– | – | – | Shape content |
The variable-length record contents depend on the shape type, which must be either the shape type given in the file header or Null. The following are the possible shape types:
Value | Shape type | Fields |
0 | Null shape | - |
1 | Point | X, Y |
3 | Polyline | MBR, Number of parts, Number of points, Parts, Points |
5 | Polygon | MBR, Number of parts, Number of points, Parts, Points |
8 | MultiPoint | MBR, Number of points, Points |
11 | PointZ | X, Y, Z Optional: M |
13 | PolylineZ | Mandatory: MBR, Number of parts, Number of points, Parts, Points, Z range, Z array Optional: M range, M array |
15 | PolygonZ | Mandatory: MBR, Number of parts, Number of points, Parts, Points, Z range, Z array Optional: M range, M array |
18 | MultiPointZ | Mandatory: MBR, Number of points, Points, Z range, Z array Optional: M range, M array |
21 | PointM | X, Y, M |
23 | PolylineM | Mandatory: MBR, Number of parts, Number of points, Parts, Points Optional: M range, M array |
25 | PolygonM | Mandatory: MBR, Number of parts, Number of points, Parts, Points Optional: M range, M array |
28 | MultiPointM | Mandatory: MBR, Number of points, Points Optional Fields: M range, M array |
31 | MultiPatch | Mandatory: MBR, Number of parts, Number of points, Parts, Part types, Points, Z range, Z array Optional: M range, M array |
Shapefile shape index format (.shx
)
The index contains the same 100-byte header as the .shp
file, followed by any number of 8-byte fixed-length records which consist of the following two fields:Bytes | Type | Endianness | Usage |
0–3 | int32 | big | Record offset |
4–7 | int32 | big | Record length |
Using this index, it is possible to seek backwards in the shapefile by, first, seeking backwards in the shape index, then reading the record offset, and using that offset to seek to the correct position in the
.shp
file. It is also possible to seek forwards an arbitrary number of records using the same method.Shapefile attribute format (.dbf
)
This file stores the attributes for each shape; it uses the dBase IV format. An alternative format that can also be used is the xBase format, which has an open specification, and is used in open source shapefile libraries, such as the Shapefile C library.The names and values of attributes are not standardized, and will be different depending on the source of the shapefile.
Shapefile spatial index format (.sbn
)
This is a binary spatial index file, which is used only by Esri software. The format is not documented by Esri. However it has been reverse-engineered and documented by the open source community. It is not currently implemented by other vendors. The .sbn
file is not strictly necessary, since the .shp
file contains all of the information necessary to successfully parse the spatial data.Limitations
Topology and the shapefile format
The shapefile format does not have the ability to store topological information. The ESRI ArcInfo coverages and personal/file/enterprise geodatabases do have the ability to store feature topology.Spatial representation
The edges of a polyline or polygon are composed of points. The spacing of the points implicitly determines the scale at which the feature is useful visually. Exceeding that scale results in jagged representation. Additional points would be required to achieve smooth shapes at greater scales. For features better represented by smooth curves, the polygon representation requires much more data storage than, for example, splines, which can capture smoothly varying shapes efficiently. None of the shapefile format types supports splines.Data storage
The size of both.shp
and .dbf
component files cannot exceed 2 GB — around 70 million point features at best. The maximum number of feature for other geometry types varies depending on the number of vertices used.The attribute database format for the
.dbf
component file is based on an older dBase standard. This database format inherently has a number of limitations:- While the current dBase standard, and GDAL/OGR support null values, ESRI software represents these values as zeros — a very serious issue for analyzing quantitative data, as it may skew representation and statistics if null quantities are represented as zero
- Poor support for Unicode field names or field storage
- Maximum length of field names is 10 characters
- Maximum number of fields is 255
- Supported field types are: floating point, integer, date, and text
- Floating point numbers may contain rounding errors since they are stored as text
Mixing shape types