Tags and columns#

OpenStreetMap uses a free tagging system, and pyrosm decides which of those tags become their own GeoDataFrame columns and which land in a JSON tags column. This section explains that system and the options that let you control the columns — and the memory use — of what you read.

How to?

Pyrosm/OSM tagging system#

OpenStreetMap uses a “free tagging system” that allows the map to include an unlimited number of attributes describing each feature. A tag consists of two items, a key and a value. Tags describe specific features of map elements (nodes, ways, or relations) or changesets. Both items are free format text fields, but can often represent also numeric or other structured items (e.g. maxspeed attribute contains speed limit information at a given road represented in numbers) (OSM Wiki, 2020).

Because of this flexibility, OSM data tend to contain huge number of different attributes. Because keeping all of these attributes in their own columns is not very practical (the dataframe can end up having even hundreds of columns), Pyrosm implements its own tagging system where only specific tags are kept as columns (separately for each OSM key). All the rest of the attributes are stored into a separete column "tags" which is a valid JSON object.

It is possible to see these default tags from the osm instance directly by accessing its configuration settings. Let’s see how:

from pyrosm import OSM, get_data

# Initialize the OSM reader with test data
fp = get_data("test_pbf")
osm = OSM(fp)

# The instance has a configuration attribute containing:
print([item for item in osm.conf.__dict__.keys() if not item.startswith("_")])
['network_filters', 'tags']

Okay, from here we can see that the configuration includes network_filter attribute and tags attribute:

  • network_filter attribute contains information about the rules that are applied when parsing different kind of roads from the OSM

  • tags attribute contains information about the tags that are parsed into columns by default

Let’s take a closer look into the tags:

# Show all available tag attributes
osm.conf.tags.available
['aerialway',
 'aeroway',
 'amenity',
 'boundary',
 'building',
 'craft',
 'emergency',
 'geological',
 'highway',
 'historic',
 'landuse',
 'leisure',
 'natural',
 'office',
 'power',
 'public_transport',
 'railway',
 'route',
 'place',
 'shop',
 'tourism',
 'waterway']

This is a list basically containing all OSM primary features that can be parsed from the OSM (see wiki for details). Each of these items contain a list of default tags (OSM keys) that will be inserted into columns when parsing the OSM data with Pyrosm.

For example the default tags that will be turned into columns from buildings can be accessed by:

# Show all tags that are converted into columns from building features
osm.conf.tags.building
['addr:city',
 'addr:country',
 'addr:full',
 'addr:housenumber',
 'addr:housename',
 'addr:postcode',
 'addr:place',
 'addr:street',
 'email',
 'name',
 'opening_hours',
 'operator',
 'phone',
 'ref',
 'url',
 'website',
 'yes',
 'building',
 'amenity',
 'building:flats',
 'building:levels',
 'building:material',
 'building:max_level',
 'building:min_level',
 'building:fireproof',
 'building:use',
 'craft',
 'height',
 'internet_access',
 'landuse',
 'levels',
 'office',
 'operator',
 'shop',
 'source',
 'start_date',
 'wikipedia']

As we can see, there are quite a few attributes that will be parsed into columns if they exist in the data. The list is mostly based on the OSM documentation about Key:building but it also contains some generic attributes that are commonly useful for many types of OSM features such as name, address information, opening_hours, website etc. Similar approach is used with all OSM Keys listed above in conf.tags.available. If the data contains additional attributes not listed in the default tags, such attributes are stored separately into a column "tags".

Let’s make an example to understand this better:

# Parse buildings
buildings = osm.get_buildings()

# Print columns
buildings.columns
Index(['addr:city', 'addr:country', 'addr:housenumber', 'addr:postcode',
       'addr:street', 'name', 'opening_hours', 'phone', 'building',
       'building:levels', 'landuse', 'shop', 'source', 'id', 'timestamp',
       'version', 'tags', 'geometry', 'osm_type'],
      dtype='object')

Our test data contains quite many of the default tags as columns (not all though). We seem to have also some additional data in the “tags” columns which were not listed in the default tag list.

  • Let’s take a closer look at those:

# List "extra" tags that were associated with some of the buildings
buildings["tags"].unique()
array([None, '{"mml:class":"42211"}', '{"mml:class":"42221"}',
       '{"mml:class":"42261"}', '{"mml:class":"42241"}',
       '{"mml:class":"42212"}'], dtype=object)

As we can see, some of the OSM elements included information about "mml:class" which is additional data that might be relevant for some, but most probably not for most, hence it is not added as a column to the GeoDataFrame.

It is still possible to access the data values of these “extra tags” by parsing the data from the JSON e.g. as follows:

import json 

# Iterate over rows having extra tags and print out the values
rows_with_extra_info = buildings.dropna(subset=["tags"])


i = 0
for row in rows_with_extra_info.itertuples():
    
    # Read the JSON
    tags = json.loads(row.tags)
    
    # Print the keys and values
    for key, value in tags.items():
        print("Key:", key, ", value: ", value)
    
    # Continue only up to first 10 
    if i == 9:
        break
    i+=1
Key: mml:class , value:  42211
Key: mml:class , value:  42211
Key: mml:class , value:  42211
Key: mml:class , value:  42211
Key: mml:class , value:  42211
Key: mml:class , value:  42211
Key: mml:class , value:  42221
Key: mml:class , value:  42211
Key: mml:class , value:  42211
Key: mml:class , value:  42211

Controlling which OSM attributes are parsed into columns#

In some cases, it might be useful to parse some of these “extra” attributes directly into columns. Doing this is easy with pyrosm which is demonstrated below.

from pyrosm import OSM, get_data
# Get test data 
fp = get_data("test_pbf")

# Initialize the reader
osm = OSM(fp)
            
buildings = osm.get_buildings()

# Print info
print("Existing columns:\n", buildings.columns)
print("\nAdditional attributes in the 'tags': \n", buildings.tags.unique())
Existing columns:
 Index(['addr:city', 'addr:country', 'addr:housenumber', 'addr:postcode',
       'addr:street', 'name', 'opening_hours', 'phone', 'building',
       'building:levels', 'landuse', 'shop', 'source', 'id', 'timestamp',
       'version', 'tags', 'osm_type', 'geometry'],
      dtype='object')

Additional attributes in the 'tags': 
 [None '{"mml:class":"42211"}' '{"mml:class":"42221"}'
 '{"mml:class":"42261"}' '{"mml:class":"42241"}' '{"mml:class":"42212"}']

The "tags" column includes additional information with key "mml:class". If we would like to parse this attribute also as a column in our resulting GeoDataFrame, we can easily do this by using extra_attributes -parameter which accepts a list of keys (one or multiple) that will be converted into columns:

# Parse buildings and store also "mml:class" as a column
buildings2 = osm.get_buildings(extra_attributes=["mml:class"])

# Print columns
buildings2.columns
Index(['addr:city', 'addr:country', 'addr:housenumber', 'addr:postcode',
       'addr:street', 'name', 'opening_hours', 'phone', 'building',
       'building:levels', 'landuse', 'shop', 'source', 'id', 'timestamp',
       'version', 'mml:class', 'osm_type', 'geometry'],
      dtype='object')

Great! Now the "mml:class" was also added as column in our GeoDataFrame:

buildings2.tail(5)
addr:city addr:country addr:housenumber addr:postcode addr:street name opening_hours phone building building:levels landuse shop source id timestamp version mml:class osm_type geometry
2188 None None None None None None None None residential None None None None 424115702 1465573852 1 42211 way POLYGON ((26.96337 60.52196, 26.96330 60.52205...
2189 None None None None None None None None residential None None None None 424115707 1465573852 1 42211 way POLYGON ((26.96773 60.53151, 26.96771 60.53167...
2190 None None None None None None None None residential None None None None 424115720 1465573853 1 42211 way POLYGON ((26.95398 60.52896, 26.95416 60.52883...
2191 None None None None None None None None residential None None None None 424115722 1465573853 1 42211 way POLYGON ((26.96623 60.53462, 26.96615 60.53469...
2192 None None None None None None None None residential None None None None 424115743 1465573855 1 42211 way POLYGON ((26.93940 60.52654, 26.93940 60.52662...

Now it is easy to access and use the values of the new column in a similar manner as any other column:

# Get unique values in the "mml:class" column
print(buildings2["mml:class"].unique())
[None '42211' '42221' '42261' '42241' '42212']

Keep only the tags you need#

By default pyrosm parses a feature’s default set of tags into columns. If you only need a few, pass tags_to_keep to a get_* method to keep just those as columns (everything else still goes to the JSON tags column). This makes the output narrower and uses less memory.

from pyrosm import OSM, get_data

osm = OSM(get_data("test_pbf"))
buildings = osm.get_buildings(tags_to_keep=["building", "addr:city"])
buildings.columns.tolist()
['building',
 'addr:city',
 'id',
 'timestamp',
 'version',
 'tags',
 'osm_type',
 'geometry']

Drop element metadata to save memory#

Every OSM element carries metadata — timestamp, version and changeset. If you don’t need it, set keep_metadata=False on the OSM reader: the metadata columns are dropped and the per-node metadata is not even decoded while parsing, which lowers memory use and parse time on node-heavy files. The default (keep_metadata=True) is unchanged. History (.osh.pbf) files always keep the metadata they require.

# Default: the metadata columns are present
present = OSM(get_data("test_pbf")).get_buildings()
[c for c in ["timestamp", "version", "changeset"] if c in present.columns]
['timestamp', 'version']
# keep_metadata=False drops them
slim = OSM(get_data("test_pbf"), keep_metadata=False).get_buildings()
[c for c in ["timestamp", "version", "changeset"] if c in slim.columns]
[]