Free IP address geolocation with MaxMind and Snowflake

With a Java UDF in Snowflake that geolocates IP addresses using MaxMind’s GeoLite2 Free Geolocation Data — including support for IPv4 and IPv6. No pre-processing needed ❄️🐻‍❄️📍

--

Image generated by AI: “Polar bear placing IP addresses on a map”

Some backstory

At my previous job I published “Geolocation with BigQuery: De-identify 76 million IP addresses in 20 second”. There I described how I transformed MaxMind’s GeoLite2 Free Geolocation database into a SQL table, ready to produce fast results thanks to a couple of “clever” SQL tricks.

The bad news: Recently Google deleted these tables that we had shared back then, so people have been asking me for help reinstating them. Since I don’t work at Google anymore, I can’t do much about this.

The good news: I work with Snowflake now! So let’s take this opportunity to revisit this use case, and let’s build a better solution thanks to some cool Snowflake abilities. We’ll even support IPv6 this time.

Alternative solutions

Before delving into this post, check out if IPinfo in the Snowflake Marketplace already has what you need:

In the post “Marketplace Monetization — Self service access to Premium datasets” I described how IPinfo shares data and functions in Snowflake to make IP geolocation easy and optimized for performance. They keep these datasets updated, accurate, and available with self-service access in Snowflake conveniently priced.

(check the video I posted too)

Getting MaxMind’s GeoLite2 Free Geolocation Data in Snowflake

With all that said, let’s write a Java UDF in Snowflake using MaxMind’s assets.

Why a Java UDF

As explained above, my previous solution involved some clever tricks and transformations with SQL to get this done. But now I can run Java inside Snowflake, including reading the proprietary mmdb files from MaxMind — and all works really well. Thanks, Snowpark.

Step 1: Get the needed files from MaxMind

JARs: You can get the Java files straight from MaxMind’s GitHub project (Apache2 licensed at https://github.com/maxmind/GeoIP2-java/releases). The latest version right now is geoip2–4.0.0-with-dependencies.zip.

Data: Getting the free GeoLite2 database is a little harder, as you will need to create an account with them and accept their terms of service. It’s “free”, but before using this at your workplace you’ll need to check their terms of use conditions — and maybe opt-in for their paid offerings instead. Their site is not straightforward to navigate once you sign up, but you’ll eventually be able to download these files.

For this post I got GeoLite2-Country_20221209.tar.gz and GeoLite2-City_20221209.tar.gz. Then to avoid any licensing issues, I’m not going to share any actual results.

Step 2: Repackage and upload to Snowflake

Decompress the files:

unzip geoip2-4.0.0-with-dependencies.zip
tar -xzf GeoLite2-City_20221209.tar.gz
tar -xzf GeoLite2-Country_20221209.tar.gz

Upload to a Snowflake stage — with SnowSQL:

use role sysadmin;
use schema fh_db.public;
create stage fh_jars;
put 'file://geoip2-4.0.0/lib/*.jar' @fh_jars auto_compress=false overwrite=true;
put 'file://GeoLite2-City_20221209/GeoLite2-City.mmdb' @fh_jars auto_compress=false overwrite=true;
put 'file://GeoLite2-Country_20221209/GeoLite2-Country.mmdb' @fh_jars auto_compress=false overwrite=true;

Step 3: Write the Java UDF

This Java UDF will give you the geolocated country for any ip address:

create or replace function geoip2_country(ip String)
returns string
language java
handler = 'X.x'
imports = ('@fh_jars/geoip2-4.0.0.jar'
, '@fh_jars/maxmind-db-3.0.0.jar'
, '@fh_jars/jackson-annotations-2.14.1.jar'
, '@fh_jars/jackson-core-2.14.1.jar'
, '@fh_jars/jackson-databind-2.14.1.jar')
as $$
import java.io.File;
import java.net.InetAddress;

import com.snowflake.snowpark_java.types.SnowflakeFile;
import com.maxmind.geoip2.model.*;
import com.maxmind.geoip2.DatabaseReader;
import com.maxmind.geoip2.exception.AddressNotFoundException;

class X {
DatabaseReader _reader;

public String x(String ip) throws Exception {
if (null == _reader) {
// lazy initialization
_reader = new DatabaseReader.Builder(SnowflakeFile.newInstance("@fh_jars/GeoLite2-Country.mmdb").getInputStream()).build();
}
try {
return _reader.country(InetAddress.getByName(ip)).getCountry().getIsoCode();
} catch (AddressNotFoundException e) {
return null;
}
}
}
$$;

If you want geolocation at the city level, read instead GeoLite2-City.mmdb, and a simple rewrite of the main block would be:

CityResponse r = _reader.city(InetAddress.getByName(ip));
return r.getCity().getName() + ", " + r.getMostSpecificSubdivision().getIsoCode() + ", "+ r.getCountry().getIsoCode();

Notes on this UDF:

  • Easy Java: Snowflake can import these JAR files just by calling imports while defining the UDF.
  • Easy binaries: Snowflake can easily process unstructured data, like the mmdb database files we are using here. The class SnowflakeFile helps us to find these files, and can output a standard InputStream.
  • Easy scaling: Snowflake will parallelize the operation, creating many instances of the UDF class.
  • Improving performance with laziness: Note the lazy initialization of _reader — that way the UDF can be called many times, and the class can be re-utilized without processing the mmdb each time.
  • Caveat: Don’t try to move the lazy initialization to the constructor of the class. At that stage you might get the error "java.io.IOException: Remote file access feature is not supported for file ...". This is an issue being investigated around the new SnowflakeFile class that allows you to access arbitrary files without the need of specifying them when declaring the UDF.
  • Caveat 2: I was able to read the mmdb file within the constructor with IMPORTS — but got overall better performance of my queries with lazy initialization anyways.
  • Get a full variant object as a reply: Just return a JSON string using r.toJson().

Step 4: Ready to test

A couple simple tests:

select geoip2_country('156.33.241.5');
-- US

select geoip2_city('156.33.241.5');
-- Washington, DC, US

select geoip2_country('2620:0:1cfe:face:b00c::3');
-- US

Yay! It works — and it evens supports IPv6 addresses.

TO-DO:

  • The UDF could treat nulls better when dealing with city level.
  • The UDF could catch badly formatted ip addresses, instead of throwing an exception.
  • Instead of X.x(), pick a better class name and merge the 2 UDFs into one Java class that you can import — instead of defining the UDF inline.

Step 5: Performance

To avoid licensing issues I’ll avoid using real IP addresses this time, and I’ll share synthetic results. Still, let’s see how fast our UDF is.

First, let’s generate 76 million random IP addresses:

create or replace table random_ips as 
select uniform(1, 255, random()) || '.' || uniform(1, 255, random()) || '.' || uniform(1, 255, random()) || '.' || uniform(1, 255, random()) ip
from table(generator(rowcount => 76000000));
-- 26s, XL

Let’s measure how long it takes Snowflake to geolocate these 76 million IPs:

select geoip2_country(ip), count(*)
from random_ips
group by 1
order by 2 desc
limit 100;
-- 20s XL
-- 9s 3XL
Fabricated results — geolocating 76 million random IP addresses

This is fast, and it works. Note that the same experiment at the city level is about 3 times slower, as it takes a longer time to go through the larger mmdb file.

Acknowledgements

I wrote this thanks to the push of my previous product users that are looking for a solution, the questions asked by the Stack Overflow user “Kaido”, and my teammate Gokhan Atil who helped them with a Snowflake answer:

Note above how Gokhan’s answer uses a more standard way of reading the mmdb file, without using the SnowflakeFile class.

Image generated by AI

Next steps

  • You can compare the performance, accuracy, and availability of MaxMind’s offering with IPInfo’s.
  • Personal note: My favorite is IPInfo, as it’s already available in the Snowflake Marketplace with a free trial — and they are constantly keeping it updated and improving its perforamnce.
  • Keep playing with Snowflake abilities to run Java, Python, and even parse unstructured data like the mmdb files here.

Want more?

I’m Felipe Hoffa, Data Cloud Advocate for Snowflake. Thanks for joining me on this adventure. You can follow me on Twitter and LinkedIn. And subscribe to reddit.com/r/snowflake for the most interesting Snowflake news.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Felipe Hoffa
Felipe Hoffa

Written by Felipe Hoffa

Developer Advocate around the data world. Ex-Google BigQuery, Ex-Snowflake, Always Felipe

No responses yet

Write a response