2 minute read

In a preceding post GeoJSON from GeoPDF we did some fast work to prove that we could get the data we need, but now we need to dig a little deeper to put it into a format that we think we can use.

All GeoPDF Content to JSON

Create JSON files containing all the data retrieved with the GDAL library with the script below.

"""Create a JSON of the same base filename of every PDF in the directory, recursively, in-place"""

import subprocess
from glob import glob

filenames = glob("**/*.pdf",recursive=True)
for pdf in filenames:
	json = pdf.replace("pdf","json")
	with open(json, 'w') as outfile:
		result = subprocess.run(["gdalinfo",pdf,"-json"], stdout=subprocess.PIPE)
		out = result.stdout.decode("utf-8")

Parse & Reform Useful Parts

Convert the raw JSON files into parsed JSON files with the content that I want with the file below.

# Process Maps with Extents.py
from glob import glob
import json

filenames = glob("**/*.json", recursive=True)
failures = []
for json_filename in filenames:
    if json_filename.find("parsed") != -1:
    with open(json_filename, 'r') as file:
        data = json.load(file)
            corners = data["wgs84Extent"]["coordinates"][0]
            lats = [lat for (_, lat) in corners[:-1]] # 5th point repeats first so slice
            lngs = [lng for (lng, _) in corners[:-1]] # 5th point repeats first so slice

            lat_avg = sum(lats) / float(len(lats))
            lng_avg = sum(lngs) / float(len(lngs))

            dict = {}
            dict["center"] = {
                "latitude": lat_avg,
                "longitude": lng_avg
            for (lng, lat) in corners:
                if lng < lng_avg and lat > lat_avg:
                    dict["northwest"] = {
                        "latitude": lat,
                        "longitude": lng
                elif lng < lng_avg and lat < lat_avg:
                    dict["southwest"] = {
                        "latitude": lat,
                        "longitude": lng
                elif lng > lng_avg and lat > lat_avg:
                    dict["northeast"] = {
                        "latitude": lat,
                        "longitude": lng
                elif lng > lng_avg and lat < lat_avg:
                    dict["southeast"] = {
                        "latitude": lat,
                        "longitude": lng
            # dict["rotationsCW"] = 0
            new_fn = json_filename.replace(".json","-parsed.json")
            content = json.dumps(dict, indent=4)
            with open(new_fn, 'w') as new_file:

with open("parsing failed.txt", 'w') as output:
    output.writelines("%s\n" % l for l in failures)

List Missing Rectangles

Some files don’t give up their contents like the others, and they will need to be process manually later. We can get a list of the files with problems by running the script below.

# ListMissingLatLonExtent.py

import subprocess
from glob import glob

missing = []
filenames = glob("**/*.json", recursive=True)
for json in filenames:
    with open(json, 'r') as file:
        str = file.read()
        if str.find("wgs84Extent") == -1 & str.find("testing") == -1:

with open("missing rectangle.txt", 'w') as output:
    output.writelines("%s\n" % l for l in missing)

Final Usable Output

The final result of the JSON that I will use in the app looks like the example below, stelprdb5339604-parsed.json, which aligns with the raw filename of the GeoPDF. I have added a rotation property because some raw PDFs are not oriented with the North at the top, and I would rather not modify the source file every time it is updated, so I will track it in the JSON file.

    "center": {
        "latitude": 40.248042999999996,
        "longitude": -106.0403644
    "northwest": {
        "latitude": 40.430488,
        "longitude": -106.3666026
    "southwest": {
        "latitude": 40.0597988,
        "longitude": -106.359167
    "southeast": {
        "latitude": 40.0655602,
        "longitude": -105.7158853
    "northeast": {
        "latitude": 40.436325,
        "longitude": -105.7198027
    "rotationsCW": 0