Most developers are familiar with fundamental data structures like arrays, linked lists, and hash tables. However, many advanced data structures exist that can optimize performance and solve complex problems more efficiently. Understanding these data structures can make a significant difference in writing optimized code and handling large-scale data efficiently. In this article, we will explore Tries, Segment trees, Skip lists, and Bloom filters.

1. Tries: fast lookups for dictionary-like data

Use case: Tries (prefix trees) are particularly useful for applications involving search auto-completion, spell checking, and IP routing.

How they work

A trie is a tree-like data structure where each node represents a character of a string. Strings with common prefixes share the same branches, making lookup operations efficient.

Operations and complexity

  • Insertion: O(n), where n is the length of the string
  • Search: O(n)
  • Memory usage: Can be high due to pointer overhead

Example Use Case: Implementing an Auto-Complete Feature

Before diving into the implementation, let's first understand the problem we aim to solve. An auto-complete feature suggests words or phrases based on user input, improving user experience in search bars, text editors, and messaging applications. This requires efficient lookup and retrieval of words that share common prefixes.

To achieve this, we use a Trie.

Insertion

Conceptually, insertion works by traversing the tree character by character:

  1. Start from the root node.
  2. For each character in the word:
    • if the character does not exist in the current node’s children, create a new node
    • move to the child node corresponding to the character
  3. After processing all characters, mark the last node as the end of a word.

Search

Searching follows a similar traversal process:

  1. Start from the root node.
  2. For each character in the word:
    • if the character does not exist in the current node’s children, return False
    • move to the child node corresponding to the character
  3. If all characters are found and the last node is marked as a word ending, return True; otherwise, return False.

Below is an implementation in Python:

class TrieNode:
    def __init__(self):
        self.children = {}
        self.is_end_of_word = False

class Trie:
    def __init__(self):
        self.root = TrieNode()

    def insert(self, word):
        node = self.root
        for char in word:
            if char not in node.children:
                node.children[char] = TrieNode()
            node = node.children[char]
        node.is_end_of_word = True

    def search(self, word):
        node = self.root
        for char in word:
            if char not in node.children:
                return False
            node = node.children[char]
        return node.is_end_of_word

2. Segment trees: Efficient range queries

Use case: Segment trees are ideal for answering range queries quickly, such as finding the sum or minimum value in a given range.

How they work

A segment tree is a binary tree where each node represents an interval. The root represents the entire range, while the leaves represent individual elements. Updates and queries are performed efficiently through tree traversal.

Operations and Complexity

  • Build tree: O(n)
  • Range query: O(log n)
  • Update operation: O(log n)

Example use case: Range sum query

Given an array, the problem here consists of computing the sum of all values in a specified range. In addition to being a common problem in competitive programming, range sum queries (RSQ) have a wide array of practical applications across various domains. These include data analytics, where RSQ can efficiently calculate cumulative statistics like moving averages; financial systems, where it helps track balances or transactions over time; and image processing, where it allows for quick computation of pixel sums over regions of an image. They are also useful in fields like game development, geospatial data analysis, and IoT systems, where aggregating values over specific ranges or time intervals is often required.

We will use a segment tree.

First, let's initialize the tree. We will implement the tree as an array.

class SegmentTree:
    def __init__(self, nums):
        self.n = len(nums)
        self.tree = [0] * (2 * self.n)
        for i in range(self.n):
            self.tree[self.n + i] = nums[i]
        for i in range(self.n - 1, 0, -1):
            self.tree[i] = self.tree[2 * i] + self.tree[2 * i + 1]
  • nums: This is the initial input array (e.g., a list of integers).
  • n: Stores the size of the input array nums.
  • tree: Initializes an array (or list) of size 2*n to store the segment tree. The segment tree is built in a bottom-up manner:
    • the first n elements (from n to 2*n - 1) will store the elements of the input array nums
    • the remaining elements (from index 1 to n - 1) will store the sum of the segments (ranges) of the array.

Buiding the segment tree

The first for loop fills in the leaf nodes of the tree (starting from index n to 2*n - 1).

The second for loop goes from the last non-leaf node (n - 1) upwards to the root (index 1) and calculates the sum for each internal node based on the sum of its two child nodes.

Updating the array

def update(self, index, value):
    pos = self.n + index
    self.tree[pos] = value
    while pos > 1:
        pos //= 2
        self.tree[pos] = self.tree[2 * pos] + self.tree[2 * pos + 1]
  • index: The index of the element in the original array that needs to be updated.
  • value: The new value that the element at the specified index should have.

Updating the Tree

The element at index indexin the original array is updated. The position in the tree is calculated as n + index.

The value at that position in the tree is updated to the new value.

After updating the leaf, the method continues upwards to update the sum values of the internal nodes. At each step, the node value is updated as the sum of its two children.

The query method

Now, let's dive in the crux of the problem : computing the sum of items in a specified range using an already built segment tree.

def query(self, left, right):
    l, r = self.n + left, self.n + right
    res = 0
    while l < r:
        if l % 2:
            res += self.tree[l]
            l += 1
        if r % 2:
            r -= 1
            res += self.tree[r]
        l //= 2
        r //= 2
    return res
  • left and right: The range for which we want to compute the sum (inclusive of left and exclusive of right).

Querying the Tree

l and r are the positions corresponding to left and right in the tree (adjusted by adding n).

We traverse the tree and sum the elements in the given range:

  • If l is an odd index, it is a right child, so we add tree[l] to the result and increment l to move to the next index.
  • If r is an odd index, it is a left child, so we add tree[r] to the result and decrement r to move to the previous index.

After each step, both l and r are halved to move upwards to the parent nodes.

The loop continues until the range is fully processed, and the result is returned.

Here is the full SegmentTree class.

class SegmentTree:
    def __init__(self, nums):
        self.n = len(nums)
        self.tree = [0] * (2 * self.n)
        for i in range(self.n):
            self.tree[self.n + i] = nums[i]
        for i in range(self.n - 1, 0, -1):
            self.tree[i] = self.tree[2 * i] + self.tree[2 * i + 1]

    def update(self, index, value):
        pos = self.n + index
        self.tree[pos] = value
        while pos > 1:
            pos //= 2
            self.tree[pos] = self.tree[2 * pos] + self.tree[2 * pos + 1]

    def query(self, left, right):
        l, r = self.n + left, self.n + right
        res = 0
        while l < r:
            if l % 2:
                res += self.tree[l]
                l += 1
            if r % 2:
                r -= 1
                res += self.tree[r]
            l //= 2
            r //= 2
        return res

3. Skip lists: Probabilistic alternative to balanced trees

Use case: Skip lists offer an alternative to balanced trees for maintaining sorted data with efficient insert, delete, and search operations. They are faster than simple linked lists and at the same time easier the implement than balanced trees.

How they work

Skip lists use multiple layers of linked lists where higher levels "skip" over elements, allowing faster searches. The probability of an element appearing at a higher level follows a geometric distribution.

Operations and Complexity

  • Insertion: O(log n)
  • Search: O(log n)
  • Deletion: O(log n)

Example use case: Ordered data storage with fast lookups

The problem being here is the efficient storage and management of ordered data, enabling fast search, insertion, and deletion operations. In traditional data structures like linked lists or arrays, these operations can be slow, especially as the data grows. Skip lists solve this by organizing the data across multiple levels, allowing quick navigation through the list and reducing the time complexity of search and updates to O(logn).

First, let's create a node for the skip list.

class Node:
    def __init__(self, value, level):
        self.value = value
        self.forward = [None] * (level + 1)
  • value: The value stored in the node
  • forward: A list that contains "pointers" (or references) to the next nodes at different levels. The length of this list is determined by the level of the node (i.e., how many "layers" or "links" the node will have). The higher the level, the more "express lanes" the node has for faster traversal.

The levelis the number of layers the node will occupy. A node at level 0 will have only one link, while a node at level k will have k+1 links (or pointers) at different levels.

Now the SkipList class.

class SkipList:
    def __init__(self, max_level=16):
        self.head = Node(-1, max_level)
        self.max_level = max_level
        self.level = 0
  • head: The "sentinel" or starting node of the skip list, which doesn’t store any meaningful data (value = -1). This node spans all possible levels up to max_level.
  • max_level: The maximum possible level in the skip list.
  • level: The current highest level used in the skip list.

The random_levelmethod

For a new node, we generate a random level at which the node will be based on a probabilistic approach. On average, about 50% of the nodes will have a higher level than the previous one, meaning that the level distribution follows a geometric distribution.

import random

def random_level(self):
    lvl = 0
    while random.random() < 0.5 and lvl < self.max_level:
        lvl += 1
    return lvl

Insertion

Fist, we have to find the position to insert.

update = [None] * (self.max_level + 1)
node = self.head
for i in range(self.level, -1, -1):
    while node.forward[i] and node.forward[i].value < value:
        node = node.forward[i]
    update[i] = node

The list update keeps track of the nodes at each level where we need to update the "forward" pointers.
We start from the highest level (level) and traverse each level down to level 0, searching for where the new node should be inserted.
We move through the list by comparing values, similar to how you might traverse a sorted list.
The update[i] array stores the nodes at each level just before where the new node will be inserted.

Next, we determine the level of the new node.

lvl = self.random_level()
if lvl > self.level:
    for i in range(self.level + 1, lvl + 1):
        update[i] = self.head
    self.level = lvl

The random_level() function is called to decide the level at which the new node will appear.
If the new node's level is greater than the current skip list's highest level (level), the update array is adjusted to point to the head node for the higher levels, and the list's level is updated.

Finally, we create and insert the new node.

new_node = Node(value, lvl)
for i in range(lvl + 1):
    new_node.forward[i] = update[i].forward[i]
    update[i].forward[i] = new_node

A new node is created with the chosen level.
The forward pointers of the new node are set to the corresponding nodes in the update list.
Finally, the update list is used to update the forward pointers of the nodes at each level, effectively inserting the new node at the correct position.

Here is the whole implementation:

import random

class Node:
    def __init__(self, value, level):
        self.value = value
        self.forward = [None] * (level + 1)

class SkipList:
    def __init__(self, max_level=16):
        self.head = Node(-1, max_level)
        self.max_level = max_level
        self.level = 0

    def random_level(self):
        lvl = 0
        while random.random() < 0.5 and lvl < self.max_level:
            lvl += 1
        return lvl

    def insert(self, value):
        update = [None] * (self.max_level + 1)
        node = self.head
        for i in range(self.level, -1, -1):
            while node.forward[i] and node.forward[i].value < value:
                node = node.forward[i]
            update[i] = node
        lvl = self.random_level()
        if lvl > self.level:
            for i in range(self.level + 1, lvl + 1):
                update[i] = self.head
            self.level = lvl
        new_node = Node(value, lvl)
        for i in range(lvl + 1):
            new_node.forward[i] = update[i].forward[i]
            update[i].forward[i] = new_node

Key Concepts to remember for a skip list:

  • Ordered data: The list is sorted as elements are inserted in order.
  • Multiple levels: The skip list has multiple layers to enable fast traversal (similar to binary search). The higher levels allow skipping over large sections of the list, reducing the search time.
  • Probabilistic balancing: The levels of the nodes are determined probabilistically (via random_level), so the skip list achieves good average performance without requiring expensive balancing operations (unlike balanced trees).

4. Bloom filters: Space-efficient membership queries

Use case: Bloom filters are useful when approximate membership queries are acceptable, such as detecting spam emails or caching.

Some real world applications of bloom filters :

  • Google Chrome uses Bloom filters in its ‘Safe Browsing’ feature. The Bloom filter helps quickly check if a URL is potentially harmful (such as known phishing or malware sites) without sending the actual URL to Google, thus enhancing privacy and speed.
  • Medium uses Bloom filters in its Recommendation module to avoid showing those posts that have already been seen by the user.
  • Cassandra/HBase uses bloom filters to optimize the search of data in an SSTable on the disk.
  • Akamai, a global content delivery networkAuthor Of article : Axel N'cho Read full article