Tracking unique visits to a page or user vists is a common requirement for business applications. Doing this with large volumes can be very difficult as the data requirements are high. Thus, we have the HyperLogLog data structure that can solve this problem, although it does only provide an approximation. This approximation is usually good enough in practice. In this article, we will learn how to use HyperLogLog in Redis with Python.
For setting up Redis, I would recommend using a service for you in prod. Azure for example, has a great redis service that scales easily. However, you will want to learn redis and eventually how to scale it yourself. This will help with debugging cloud services or eventually, saving money and not using them.
We will start our intro to redis via using docker compose. Create a docker-compose.yml
file and add the following.
version: "3.2"
services:
redis:
image: "redis:alpine"
command: redis-server
ports:
- "6379:6379"
volumes:
- $PWD/redis-data:/var/lib/redis
- $PWD/redis.conf:/usr/local/etc/redis/redis.conf
environment:
- REDIS_REPLICATION_MODE=master
Ensure you have docker installed and run
docker-compose up
In python, the main used redis module is called redis-py
and can be installed using the follows.
pip install redis
Let's open up a new file, index.py
and go through many of
the common commands you will used with lists in redis.
We can add items to a HyperLogLog data type using the pfadd
function. We specify the name of the key, in this case "users", then we pass in the members.
# PF Add
r.pfadd("users", "user1", "user2")
For each of the examples below, I will use the following template to run all the commands. Here is my full index file. We will just replace the commands each time.
import redis
r = redis.Redis(host='localhost', port=6379, db=0)
r.pfadd("users", "user1", "user2")
Once we have a hyperloglog set, we can now get the count. HyperLogLog uses an approximation to help with a high volume, so the count wont always be exact. To get this count, pfcount
is the method to use.
# PF Count
result = r.pfcount("users")
print(result) # 2
The last method for hyperloglog will be the merge command. We can use pfmerge
and pass in the name of two hyperloglog data sets to combine them. In the below example, you can see the we have two sets and after merged, only the unique members are counted.
# PF Merge
r.pfadd("users-app1", "user1", "user2")
r.pfadd("users-app2", "user1", "user3")
r.pfmerge("users-new", "users-app1", "users-app2")
result = r.pfcount("users-new")
print(result) # 3