Tags and columns

Tags and columns#

OpenStreetMap uses a free tagging system, and pyrosm decides which of those tags become their own GeoDataFrame columns and which land in a JSON tags column. This section explains that system and the options that let you control the columns — and the memory use — of what you read.

How to?

The Pyrosm/OSM tagging system
Control which attributes become columns
Keep only the tags you need
Drop element metadata to save memory

Pyrosm/OSM tagging system#

OpenStreetMap uses a “free tagging system” that allows the map to include an unlimited number of attributes describing each feature. A tag consists of two items, a key and a value. Tags describe specific features of map elements (nodes, ways, or relations) or changesets. Both items are free format text fields, but can often represent also numeric or other structured items (e.g. maxspeed attribute contains speed limit information at a given road represented in numbers) (OSM Wiki, 2020).

Because of this flexibility, OSM data tend to contain huge number of different attributes. Because keeping all of these attributes in their own columns is not very practical (the dataframe can end up having even hundreds of columns), Pyrosm implements its own tagging system where only specific tags are kept as columns (separately for each OSM key). All the rest of the attributes are stored into a separete column "tags" which is a valid JSON object.

It is possible to see these default tags from the osm instance directly by accessing its configuration settings. Let’s see how:

from pyrosm import OSM, get_data

# Initialize the OSM reader with test data
fp = get_data("test_pbf")
osm = OSM(fp)

# The instance has a configuration attribute containing:
print([item for item in osm.conf.__dict__.keys() if not item.startswith("_")])

['network_filters', 'tags']

Okay, from here we can see that the configuration includes network_filter attribute and tags attribute:

network_filter attribute contains information about the rules that are applied when parsing different kind of roads from the OSM
tags attribute contains information about the tags that are parsed into columns by default

Let’s take a closer look into the tags:

# Show all available tag attributes
osm.conf.tags.available

['aerialway',
 'aeroway',
 'amenity',
 'boundary',
 'building',
 'craft',
 'emergency',
 'geological',
 'highway',
 'historic',
 'landuse',
 'leisure',
 'natural',
 'office',
 'power',
 'public_transport',
 'railway',
 'route',
 'place',
 'shop',
 'tourism',
 'waterway']

This is a list basically containing all OSM primary features that can be parsed from the OSM (see wiki for details). Each of these items contain a list of default tags (OSM keys) that will be inserted into columns when parsing the OSM data with Pyrosm.

For example the default tags that will be turned into columns from buildings can be accessed by:

# Show all tags that are converted into columns from building features
osm.conf.tags.building

['addr:city',
 'addr:country',
 'addr:full',
 'addr:housenumber',
 'addr:housename',
 'addr:postcode',
 'addr:place',
 'addr:street',
 'email',
 'name',
 'opening_hours',
 'operator',
 'phone',
 'ref',
 'url',
 'website',
 'yes',
 'building',
 'amenity',
 'building:flats',
 'building:levels',
 'building:material',
 'building:max_level',
 'building:min_level',
 'building:fireproof',
 'building:use',
 'craft',
 'height',
 'internet_access',
 'landuse',
 'levels',
 'office',
 'operator',
 'shop',
 'source',
 'start_date',
 'wikipedia']

As we can see, there are quite a few attributes that will be parsed into columns if they exist in the data. The list is mostly based on the OSM documentation about Key:building but it also contains some generic attributes that are commonly useful for many types of OSM features such as name, address information, opening_hours, website etc. Similar approach is used with all OSM Keys listed above in conf.tags.available. If the data contains additional attributes not listed in the default tags, such attributes are stored separately into a column "tags".

Let’s make an example to understand this better:

# Parse buildings
buildings = osm.get_buildings()

# Print columns
buildings.columns

Index(['addr:city', 'addr:country', 'addr:housenumber', 'addr:postcode',
       'addr:street', 'name', 'opening_hours', 'phone', 'building',
       'building:levels', 'landuse', 'shop', 'source', 'id', 'timestamp',
       'version', 'tags', 'geometry', 'osm_type'],
      dtype='object')

Our test data contains quite many of the default tags as columns (not all though). We seem to have also some additional data in the “tags” columns which were not listed in the default tag list.

Let’s take a closer look at those:

# List "extra" tags that were associated with some of the buildings
buildings["tags"].unique()

array([None, '{"mml:class":"42211"}', '{"mml:class":"42221"}',
       '{"mml:class":"42261"}', '{"mml:class":"42241"}',
       '{"mml:class":"42212"}'], dtype=object)

As we can see, some of the OSM elements included information about "mml:class" which is additional data that might be relevant for some, but most probably not for most, hence it is not added as a column to the GeoDataFrame.

It is still possible to access the data values of these “extra tags” by parsing the data from the JSON e.g. as follows:

import json 

# Iterate over rows having extra tags and print out the values
rows_with_extra_info = buildings.dropna(subset=["tags"])


i = 0
for row in rows_with_extra_info.itertuples():
    
    # Read the JSON
    tags = json.loads(row.tags)
    
    # Print the keys and values
    for key, value in tags.items():
        print("Key:", key, ", value: ", value)
    
    # Continue only up to first 10 
    if i == 9:
        break
    i+=1

Key: mml:class , value:  42211
Key: mml:class , value:  42211
Key: mml:class , value:  42211
Key: mml:class , value:  42211
Key: mml:class , value:  42211
Key: mml:class , value:  42211
Key: mml:class , value:  42221
Key: mml:class , value:  42211
Key: mml:class , value:  42211
Key: mml:class , value:  42211

Controlling which OSM attributes are parsed into columns#

In some cases, it might be useful to parse some of these “extra” attributes directly into columns. Doing this is easy with pyrosm which is demonstrated below.

from pyrosm import OSM, get_data
# Get test data 
fp = get_data("test_pbf")

# Initialize the reader
osm = OSM(fp)
            
buildings = osm.get_buildings()

# Print info
print("Existing columns:\n", buildings.columns)
print("\nAdditional attributes in the 'tags': \n", buildings.tags.unique())

Existing columns:
 Index(['addr:city', 'addr:country', 'addr:housenumber', 'addr:postcode',
       'addr:street', 'name', 'opening_hours', 'phone', 'building',
       'building:levels', 'landuse', 'shop', 'source', 'id', 'timestamp',
       'version', 'tags', 'osm_type', 'geometry'],
      dtype='object')

Additional attributes in the 'tags': 
 [None '{"mml:class":"42211"}' '{"mml:class":"42221"}'
 '{"mml:class":"42261"}' '{"mml:class":"42241"}' '{"mml:class":"42212"}']

The "tags" column includes additional information with key "mml:class". If we would like to parse this attribute also as a column in our resulting GeoDataFrame, we can easily do this by using extra_attributes -parameter which accepts a list of keys (one or multiple) that will be converted into columns:

# Parse buildings and store also "mml:class" as a column
buildings2 = osm.get_buildings(extra_attributes=["mml:class"])

# Print columns
buildings2.columns

Index(['addr:city', 'addr:country', 'addr:housenumber', 'addr:postcode',
       'addr:street', 'name', 'opening_hours', 'phone', 'building',
       'building:levels', 'landuse', 'shop', 'source', 'id', 'timestamp',
       'version', 'mml:class', 'osm_type', 'geometry'],
      dtype='object')

Great! Now the "mml:class" was also added as column in our GeoDataFrame:

buildings2.tail(5)

	addr:city	addr:country	addr:housenumber	addr:postcode	addr:street	name	opening_hours	phone	building	building:levels	landuse	shop	source	id	timestamp	version	mml:class	osm_type	geometry
2188	None	None	None	None	None	None	None	None	residential	None	None	None	None	424115702	1465573852	1	42211	way	POLYGON ((26.96337 60.52196, 26.96330 60.52205...
2189	None	None	None	None	None	None	None	None	residential	None	None	None	None	424115707	1465573852	1	42211	way	POLYGON ((26.96773 60.53151, 26.96771 60.53167...
2190	None	None	None	None	None	None	None	None	residential	None	None	None	None	424115720	1465573853	1	42211	way	POLYGON ((26.95398 60.52896, 26.95416 60.52883...
2191	None	None	None	None	None	None	None	None	residential	None	None	None	None	424115722	1465573853	1	42211	way	POLYGON ((26.96623 60.53462, 26.96615 60.53469...
2192	None	None	None	None	None	None	None	None	residential	None	None	None	None	424115743	1465573855	1	42211	way	POLYGON ((26.93940 60.52654, 26.93940 60.52662...

Now it is easy to access and use the values of the new column in a similar manner as any other column:

# Get unique values in the "mml:class" column
print(buildings2["mml:class"].unique())

[None '42211' '42221' '42261' '42241' '42212']

Keep only the tags you need#

By default pyrosm parses a feature’s default set of tags into columns. If you only need a few, pass tags_to_keep to a get_* method to keep just those as columns (everything else still goes to the JSON tags column). This makes the output narrower and uses less memory.

from pyrosm import OSM, get_data

osm = OSM(get_data("test_pbf"))
buildings = osm.get_buildings(tags_to_keep=["building", "addr:city"])
buildings.columns.tolist()

['building',
 'addr:city',
 'id',
 'timestamp',
 'version',
 'tags',
 'osm_type',
 'geometry']

Drop element metadata to save memory#

Every OSM element carries metadata — timestamp, version and changeset. If you don’t need it, set keep_metadata=False on the OSM reader: the metadata columns are dropped and the per-node metadata is not even decoded while parsing, which lowers memory use and parse time on node-heavy files. The default (keep_metadata=True) is unchanged. History (.osh.pbf) files always keep the metadata they require.

# Default: the metadata columns are present
present = OSM(get_data("test_pbf")).get_buildings()
[c for c in ["timestamp", "version", "changeset"] if c in present.columns]

['timestamp', 'version']

# keep_metadata=False drops them
slim = OSM(get_data("test_pbf"), keep_metadata=False).get_buildings()
[c for c in ["timestamp", "version", "changeset"] if c in slim.columns]

[]