Python Redis HyperLogLog Commands

09.05.2021

Intro

Tracking unique visits to a page or user vists is a common requirement for business applications. Doing this with large volumes can be very difficult as the data requirements are high. Thus, we have the HyperLogLog data structure that can solve this problem, although it does only provide an approximation. This approximation is usually good enough in practice. In this article, we will learn how to use HyperLogLog in Redis with Python.

Setting up Redis

For setting up Redis, I would recommend using a service for you in prod. Azure for example, has a great redis service that scales easily. However, you will want to learn redis and eventually how to scale it yourself. This will help with debugging cloud services or eventually, saving money and not using them.

We will start our intro to redis via using docker compose. Create a docker-compose.yml file and add the following.

version: "3.2"
services:
  redis:
    image: "redis:alpine"
    command: redis-server
    ports:
      - "6379:6379"
    volumes:
      - $PWD/redis-data:/var/lib/redis
      - $PWD/redis.conf:/usr/local/etc/redis/redis.conf
    environment:
      - REDIS_REPLICATION_MODE=master

Ensure you have docker installed and run

docker-compose up

Installing Redis Modules

In python, the main used redis module is called redis-py and can be installed using the follows.

pip install redis

Writing the Code

Let's open up a new file, index.py and go through many of the common commands you will used with lists in redis.

Adding to a HyperLogLog

We can add items to a HyperLogLog data type using the pfadd function. We specify the name of the key, in this case "users", then we pass in the members.

# PF  Add
r.pfadd("users", "user1", "user2")

For each of the examples below, I will use the following template to run all the commands. Here is my full index file. We will just replace the commands each time.

import redis

r = redis.Redis(host='localhost', port=6379, db=0)

r.pfadd("users", "user1", "user2")

HyperLogLog Count

Once we have a hyperloglog set, we can now get the count. HyperLogLog uses an approximation to help with a high volume, so the count wont always be exact. To get this count, pfcount is the method to use.

# PF Count
result = r.pfcount("users")
print(result) # 2

HyperLogLog Merge

The last method for hyperloglog will be the merge command. We can use pfmerge and pass in the name of two hyperloglog data sets to combine them. In the below example, you can see the we have two sets and after merged, only the unique members are counted.

# PF Merge
r.pfadd("users-app1", "user1", "user2")
r.pfadd("users-app2", "user1", "user3")

r.pfmerge("users-new", "users-app1", "users-app2")

result = r.pfcount("users-new")
print(result) # 3