Benchmarks

Pyrosm aims to be an easy-to-use and fast Python tool for parsing OpenStreetMap data from Protocolbuffer Binary Format (PBF) files into geopandas which is the Python’s go-to library for working with spatial data. Pyrosm has been written mainly in Cython (Python with C-like performance) which makes it probably faster than any other Python alternatives for parsing OpenStreetMap data. Pyrosm is built on top of another Cython library called Pyrobuf which is a faster Cython alternative to Google’s Protobuf library: It provides 2-4x boost in performance for deserializing the protocol buffer messages compared to Google’s version with C++ backend. Google’s Protocol Buffers is a commonly used and efficient method to serialize and compress structured data which is also used by OpenStreetMap contributors to distribute the OSM data in PBF format (Protocolbuffer Binary Format). In addition, Pyrosm uses extensively Numpy array operations and parses geometries using Pygeos which are both excellent choises for fast and memory efficient data manipulation in Python.

To better understand the performance of Pyrosm, here pyrosm is compared against other similar tools. There are various tools available for parsing OSM data (such as Osmosis, Osmium-tool, esy-osm-pbf, osmread), however, most of them are not easy to use, they are not for Python, or they are hard to install with all operating systems. The most similar tool to Pyrosm is OSMnx which makes it possible to retrieve OpenStreetMap data easily into GeoDataFrames utilizing OverPass API (it also inspired building this one!). Hence, the comparisons and benchmarking are done between Pyrosm and OSMnx. In the tests, we benchmark the time it takes to retrieve and parse different datasets for Helsinki Region, Finland. We also test how long it takes to parse data covering large geographical areas such as the state of New York.

Hardware: The benchmarks are conducted with Lenovo T480 laptop with 16GB of RAM, SSD-disk and Intel Core i5-8250U CPU (1.6 GHZ) running on Windows 10.

import osmnx as ox
from shapely.geometry import box
from pyrosm import OSM, get_data
import time

# Parse buildings with Pyrosm and time it
start_t = time.time()

# A PBF data for Helsinki Region (~34 MB)
# will be downloaded automatically to TEMP with Pyrosm
pbf_path = get_data("helsinki_region_pbf")

osm = OSM(pbf_path)
buildings_A = osm.get_buildings()
print("=====================================\nPYROSM\n=====================================")
print(f"Parsing buildings with Pyrosm lasted {round(time.time() - start_t, 1)} seconds.")
print(f"Number of buildings in the Pyrosm dataset: {len(buildings_A)}")
print("\n=====================================\n")

# Parse buildings from the same region using OSMnx and time it
start_t = time.time()
data_extent = box(*buildings_A.total_bounds)
buildings_B = ox.footprints_from_polygon(polygon=data_extent,
                                         footprint_type="building",
                                         retain_invalid=False,
                                         )
print("=====================================\nOSMNX\n=====================================")
print(f"Parsing buildings with OSMnx lasted {round(time.time() - start_t, 1)} seconds.")
print(f"Number of buildings in the OSMnx dataset: {len(buildings_B)}")
print("\n=====================================\n")
Downloaded Protobuf data 'Helsinki_region.osm.pbf' (35.0 MB) to TEMP:
'C:\Users\LOCALA~1\AppData\Local\Temp\pyrosm\Helsinki_region.osm.pbf'
=====================================
PYROSM
=====================================
Parsing buildings with Pyrosm lasted 22.9 seconds.
Number of buildings in the Pyrosm dataset: 175970

=====================================

=====================================
OSMNX
=====================================
Parsing buildings with OSMnx lasted 60.9 seconds.
Number of buildings in the OSMnx dataset: 180893

=====================================

Okay, as we can see Pyrosm, is approximately 2.7 times faster than OSMnx in parsing buildings from the given area.

It’s not bad, but actually the difference between the two comes even more evident when parsing multiple datasets from OSM simultaneously (e.g. roads, buildings, and Points of Interests). Whereas OSMnx does separate OverPass API calls for each of these datasets separately, Pyrosm needs to download the raw data only once, and then parse different datasets from the same data-dump.

Let’s conduct another comparison between the tools by reading buildings, Points of Interest (amenities) and roads from OpenStreetMap:

import osmnx as ox
from shapely.geometry import box
from pyrosm import OSM, get_data
import time

# =======================
# PYROSM
# =======================

# Total time
tot_t = time.time()

# A PBF data for Helsinki Region (~34 MB)
# Will be downloaded automatically to TEMP with Pyrosm
pbf_path = get_data("helsinki_region_pbf")
osm = OSM(pbf_path)

start_t = time.time()
buildings_A = osm.get_buildings()
print("=====================================\nPYROSM\n=====================================")
print(f"Parsing buildings with Pyrosm lasted {round(time.time() - start_t, 1)} seconds.")
print(f"Number of buildings in the Pyrosm dataset: {len(buildings_A)}")
print("\n......................................\n")

start_t = time.time()
roads_A = osm.get_network("driving")
print(f"Parsing roads with Pyrosm lasted {round(time.time() - start_t, 1)} seconds.")
print(f"Number of roads in the Pyrosm dataset: {len(roads_A)}")
print("\n......................................\n")

start_t = time.time()
pois_A = osm.get_pois({"amenity": True})
print(f"Parsing POIs with Pyrosm lasted {round(time.time() - start_t, 1)} seconds.")
print(f"Number of roads in the Pyrosm dataset: {len(pois_A)}")
print("\n......................................\n")
print(f"TOTAL TIME: {round(time.time() - tot_t, 1)} seconds." )
print("\n======================================\n")

# =======================
# OSMNX
# =======================
data_extent = box(*buildings_A.total_bounds)

# Total time
tot_t = time.time()

# Parse data from the same region using OSMnx and time it
start_t = time.time()
data_extent = box(*buildings_A.total_bounds)

buildings_B = ox.footprints_from_polygon(polygon=data_extent,
                                         footprint_type="building",
                                         retain_invalid=False,
                                         )
print("=====================================\nOSMNX\n=====================================")
print(f"Parsing buildings with OSMnx lasted {round(time.time() - start_t, 1)} seconds.")
print(f"Number of buildings in the OSMnx dataset: {len(buildings_B)}")
print("\n......................................\n")

start_t = time.time()

# Parsing steetnetworks with OSMnx requires first building the graph
# and then parsing GeoDataFrame from it (afaik, there's no way to get gdf directly)
roads_B_graph = ox.graph_from_polygon(polygon=data_extent, network_type="drive")
roads_B = ox.graph_to_gdfs(roads_B_graph, nodes=False)

print(f"Parsing roads with OSMnx lasted {round(time.time() - start_t, 1)} seconds.")
print(f"Number of roads in the OSMnx dataset: {len(roads_B)}")
print("\n......................................\n")

start_t = time.time()
pois_B = ox.pois_from_polygon(polygon=data_extent)

print(f"Parsing POIs with OSMnx lasted {round(time.time() - start_t, 1)} seconds.")
print(f"Number of POIs in the OSMnx dataset: {len(pois_B)}")
print("\n......................................\n")
print(f"TOTAL TIME: {round(time.time() - tot_t, 1)} seconds." )
print("\n======================================\n")
=====================================
PYROSM
=====================================
Parsing buildings with Pyrosm lasted 18.4 seconds.
Number of buildings in the Pyrosm dataset: 175970

......................................

Parsing roads with Pyrosm lasted 3.2 seconds.
Number of roads in the Pyrosm dataset: 85397

......................................

Parsing POIs with Pyrosm lasted 4.3 seconds.
Number of roads in the Pyrosm dataset: 26102

......................................

TOTAL TIME: 26.7 seconds.

======================================

=====================================
OSMNX
=====================================
Parsing buildings with OSMnx lasted 58.0 seconds.
Number of buildings in the OSMnx dataset: 180893

......................................

Parsing roads with OSMnx lasted 78.3 seconds.
Number of roads in the OSMnx dataset: 60406

......................................

Parsing POIs with OSMnx lasted 52.9 seconds.
Number of POIs in the OSMnx dataset: 30821

......................................

TOTAL TIME: 189.2 seconds.

======================================

Okay, as we can see from these results, Pyrosm is now approximately 7x faster altogether. However, the difference is even larger when looking at timings of the last two datasets:

  • Parsing roads with Pyrosm took 3.2 seconds compared to 78.3 seconds with OSMnx (Pyrosm was ~24x faster)

  • Parsing POIs with Pyrosm took 4.3 seconds compared to 52.9 seconds with OSMnx (Pyrosm was ~12x faster)

These differences are partially due to the difference in design of the tools. Pyrosm downloads the data only once and parses all basic OSM elements from the PBF during the first call. After the first call, all the other calls (with the same initialized OSM instance) are read and parsed directly from the memory, which is very fast. OSMnx also supports caching, meaning that if you make identical calls, OSMnx does not necessarily fetch the data another time from the API. However, if you make a slight change to the call, OSMnx needs to make a new API call to OverPass API, whereas Pyrosm uses the same raw data dump once initialized.

Caveats of Pyrosm compared to OSMnx

Although Pyrosm is fast and as easy to use as OSMnx, there is currently a clear difference between the two in usability which relates to retrieving the raw OSM data. Whereas OSMnx works all over the world straight from the box, with Pyrosm it is currently needed to separately download the data (and possibly even cropped) from Geofabrik before you can get the speed benefits of the tool. Pyrosm has a few test datasets (see pyrosm.data.available) available that can be used easily, but in the future, the hope is to add support for clipping the PBF data by a bounding box and saving the extract into a new PBF (similarly as Osmosis/Osmium-tool does now). After this, it would be possible to add automatic download of PBF data in a similar manner as currently in OSMnx.

Parsing large datasets

While obtaining relatively small OSM datasets is easy using e.g. OSMnx, Pyrosm starts to shine when you need to obtain data from large geographical areas such as countries or states, or when parsing OSM data quickly from local disk is important (the original reason why this library was developed).

As an example of such case, next we will measure how long it takes to parse all roads and buildings from the state of New York in USA. The data (~210 MB) can be downloaded from Geofabrik. The test is done using a laptop with 16GB memory, SSD drive, and Intel Core i5-8250U CPU 1.6 GHZ).

from pyrosm import OSM, get_data
import time

# Initialize (downloads data automatically for New York State)
fp = get_data("new_york")
osm = OSM(fp)

# Parse roads and time it
start_time = time.time()
roads = osm.get_network("driving")
print(f"Parsing roads lasted {round(time.time() - start_time, 0)} seconds.")
print(f"Number of roads parsed: {len(roads)}")
Downloaded Protobuf data 'new-york-latest.osm.pbf' (208.1 MB) to TEMP:
'C:\Users\LOCALA~1\AppData\Local\Temp\pyrosm\new-york-latest.osm.pbf'
Parsing roads lasted 161.0 seconds.
Number of roads parsed: 625779

Okay, so it took 2.7 minutes to parse around 615,000 drivable roads from OSM and create a GeoDataFrame from the data, not bad! (OSMnx was still running after 3 hours after which the test was stopped without results).

And this is how the data looks like on a map (plotting done separately using QGIS):

New York State roads

  • Let’s do the same test for buildings

from pyrosm import OSM, get_data
import time

# Initialize
fp = get_data("new_york")
osm = OSM(fp)

# Parse buildings and time it
start_time = time.time()
buildings = osm.get_buildings()
print(f"Parsing buildings lasted {round(time.time() - start_time, 0)} seconds.")
print(f"Number of buildings parsed: {len(buildings)}")
Parsing buildings lasted 214.0 seconds.
Number of buildings parsed: 2231758

Okay, and as we can see parsing around 2.2 million buildings into a GeoDataFrame from the same area took around 3.6 minutes. (trying to do the same thing with OSMnx ended with a memory error after an hour).

And this is how the data looks like on a map:

New York State buildings

And a close-up to New York City:

New York City buildings

As we can see, parsing large datasets from OSM Protobuf files into GeoDataFrame is very easy and fast with Pyrosm. The available physical memory of the computer is the most significant limitation that comes to parsing very large datasets. With 16GB of RAM on the computer, it should be possible to read fairly easily OSM data from Protobuf file up to a size of 250 MB. The most memory consuming part currently is constructing Shapely geometries into GeoDataFrame. There might be improvements coming on this once Geopandas starts to support Pygeos geometry arrays.