Most developers are familiar with fundamental data structures like arrays, linked lists, and hash tables. However, many advanced data structures exist that can optimize performance and solve complex problems more efficiently. Understanding these data structures can make a significant difference in writing optimized code and handling large-scale data efficiently. In this article, we will explore Tries, Segment trees, Skip lists, and Bloom filters.
1. Tries: fast lookups for dictionary-like data
Use case: Tries (prefix trees) are particularly useful for applications involving search auto-completion, spell checking, and IP routing.
How they work
A trie is a tree-like data structure where each node represents a character of a string. Strings with common prefixes share the same branches, making lookup operations efficient.
Operations and complexity
- Insertion: O(n), where n is the length of the string
- Search: O(n)
- Memory usage: Can be high due to pointer overhead
Example Use Case: Implementing an Auto-Complete Feature
Before diving into the implementation, let's first understand the problem we aim to solve. An auto-complete feature suggests words or phrases based on user input, improving user experience in search bars, text editors, and messaging applications. This requires efficient lookup and retrieval of words that share common prefixes.
To achieve this, we use a Trie.
Insertion
Conceptually, insertion works by traversing the tree character by character:
- Start from the root node.
- For each character in the word:
- if the character does not exist in the current node’s children, create a new node
- move to the child node corresponding to the character
- After processing all characters, mark the last node as the end of a word.
Search
Searching follows a similar traversal process:
- Start from the root node.
- For each character in the word:
- if the character does not exist in the current node’s children, return
False
- move to the child node corresponding to the character
- if the character does not exist in the current node’s children, return
- If all characters are found and the last node is marked as a word ending, return
True
; otherwise, returnFalse
.
Below is an implementation in Python:
class TrieNode:
def __init__(self):
self.children = {}
self.is_end_of_word = False
class Trie:
def __init__(self):
self.root = TrieNode()
def insert(self, word):
node = self.root
for char in word:
if char not in node.children:
node.children[char] = TrieNode()
node = node.children[char]
node.is_end_of_word = True
def search(self, word):
node = self.root
for char in word:
if char not in node.children:
return False
node = node.children[char]
return node.is_end_of_word
2. Segment trees: Efficient range queries
Use case: Segment trees are ideal for answering range queries quickly, such as finding the sum or minimum value in a given range.
How they work
A segment tree is a binary tree where each node represents an interval. The root represents the entire range, while the leaves represent individual elements. Updates and queries are performed efficiently through tree traversal.
Operations and Complexity
- Build tree: O(n)
- Range query: O(log n)
- Update operation: O(log n)
Example use case: Range sum query
Given an array, the problem here consists of computing the sum of all values in a specified range. In addition to being a common problem in competitive programming, range sum queries (RSQ) have a wide array of practical applications across various domains. These include data analytics, where RSQ can efficiently calculate cumulative statistics like moving averages; financial systems, where it helps track balances or transactions over time; and image processing, where it allows for quick computation of pixel sums over regions of an image. They are also useful in fields like game development, geospatial data analysis, and IoT systems, where aggregating values over specific ranges or time intervals is often required.
We will use a segment tree.
First, let's initialize the tree. We will implement the tree as an array.
class SegmentTree:
def __init__(self, nums):
self.n = len(nums)
self.tree = [0] * (2 * self.n)
for i in range(self.n):
self.tree[self.n + i] = nums[i]
for i in range(self.n - 1, 0, -1):
self.tree[i] = self.tree[2 * i] + self.tree[2 * i + 1]
nums
: This is the initial input array (e.g., a list of integers).n
: Stores the size of the input arraynums
.tree
: Initializes an array (or list) of size2*n
to store the segment tree. The segment tree is built in a bottom-up manner:- the first
n
elements (fromn
to2*n - 1
) will store the elements of the input arraynums
- the remaining elements (from index
1
ton - 1
) will store the sum of the segments (ranges) of the array.
- the first
Buiding the segment tree
The first for
loop fills in the leaf nodes of the tree (starting from index n
to 2*n - 1
).
The second for
loop goes from the last non-leaf node (n - 1
) upwards to the root (index 1
) and calculates the sum for each internal node based on the sum of its two child nodes.
Updating the array
def update(self, index, value):
pos = self.n + index
self.tree[pos] = value
while pos > 1:
pos //= 2
self.tree[pos] = self.tree[2 * pos] + self.tree[2 * pos + 1]
index
: The index of the element in the original array that needs to be updated.value
: The new value that the element at the specified index should have.
Updating the Tree
The element at index index
in the original array is updated. The position in the tree is calculated as n + index
.
The value at that position in the tree is updated to the new value
.
After updating the leaf, the method continues upwards to update the sum values of the internal nodes. At each step, the node value is updated as the sum of its two children.
The query method
Now, let's dive in the crux of the problem : computing the sum of items in a specified range using an already built segment tree.
def query(self, left, right):
l, r = self.n + left, self.n + right
res = 0
while l < r:
if l % 2:
res += self.tree[l]
l += 1
if r % 2:
r -= 1
res += self.tree[r]
l //= 2
r //= 2
return res
left
andright
: The range for which we want to compute the sum (inclusive of left and exclusive of right).
Querying the Tree
l
and r
are the positions corresponding to left
and right
in the tree (adjusted by adding n
).
We traverse the tree and sum the elements in the given range:
- If
l
is an odd index, it is a right child, so we addtree[l]
to the result and incrementl
to move to the next index. - If
r
is an odd index, it is a left child, so we addtree[r]
to the result and decrementr
to move to the previous index.
After each step, both l
and r
are halved to move upwards to the parent nodes.
The loop continues until the range is fully processed, and the result is returned.
Here is the full SegmentTree
class.
class SegmentTree:
def __init__(self, nums):
self.n = len(nums)
self.tree = [0] * (2 * self.n)
for i in range(self.n):
self.tree[self.n + i] = nums[i]
for i in range(self.n - 1, 0, -1):
self.tree[i] = self.tree[2 * i] + self.tree[2 * i + 1]
def update(self, index, value):
pos = self.n + index
self.tree[pos] = value
while pos > 1:
pos //= 2
self.tree[pos] = self.tree[2 * pos] + self.tree[2 * pos + 1]
def query(self, left, right):
l, r = self.n + left, self.n + right
res = 0
while l < r:
if l % 2:
res += self.tree[l]
l += 1
if r % 2:
r -= 1
res += self.tree[r]
l //= 2
r //= 2
return res
3. Skip lists: Probabilistic alternative to balanced trees
Use case: Skip lists offer an alternative to balanced trees for maintaining sorted data with efficient insert, delete, and search operations. They are faster than simple linked lists and at the same time easier the implement than balanced trees.
How they work
Skip lists use multiple layers of linked lists where higher levels "skip" over elements, allowing faster searches. The probability of an element appearing at a higher level follows a geometric distribution.
Operations and Complexity
- Insertion: O(log n)
- Search: O(log n)
- Deletion: O(log n)
Example use case: Ordered data storage with fast lookups
The problem being here is the efficient storage and management of ordered data, enabling fast search, insertion, and deletion operations. In traditional data structures like linked lists or arrays, these operations can be slow, especially as the data grows. Skip lists solve this by organizing the data across multiple levels, allowing quick navigation through the list and reducing the time complexity of search and updates to O(logn).
First, let's create a node for the skip list.
class Node:
def __init__(self, value, level):
self.value = value
self.forward = [None] * (level + 1)
value
: The value stored in the nodeforward
: A list that contains "pointers" (or references) to the next nodes at different levels. The length of this list is determined by the level of the node (i.e., how many "layers" or "links" the node will have). The higher the level, the more "express lanes" the node has for faster traversal.
The level
is the number of layers the node will occupy. A node at level 0 will have only one link, while a node at level k
will have k+1
links (or pointers) at different levels.
Now the SkipList
class.
class SkipList:
def __init__(self, max_level=16):
self.head = Node(-1, max_level)
self.max_level = max_level
self.level = 0
head
: The "sentinel" or starting node of the skip list, which doesn’t store any meaningful data (value = -1
). This node spans all possible levels up tomax_level
.max_level
: The maximum possible level in the skip list.level
: The current highest level used in the skip list.
The random_level
method
For a new node, we generate a random level at which the node will be based on a probabilistic approach. On average, about 50% of the nodes will have a higher level than the previous one, meaning that the level distribution follows a geometric distribution.
import random
def random_level(self):
lvl = 0
while random.random() < 0.5 and lvl < self.max_level:
lvl += 1
return lvl
Insertion
Fist, we have to find the position to insert.
update = [None] * (self.max_level + 1)
node = self.head
for i in range(self.level, -1, -1):
while node.forward[i] and node.forward[i].value < value:
node = node.forward[i]
update[i] = node
The list update
keeps track of the nodes at each level where we need to update the "forward" pointers.
We start from the highest level (level
) and traverse each level down to level 0, searching for where the new node should be inserted.
We move through the list by comparing values, similar to how you might traverse a sorted list.
The update[i]
array stores the nodes at each level just before where the new node will be inserted.
Next, we determine the level of the new node.
lvl = self.random_level()
if lvl > self.level:
for i in range(self.level + 1, lvl + 1):
update[i] = self.head
self.level = lvl
The random_level()
function is called to decide the level at which the new node will appear.
If the new node's level is greater than the current skip list's highest level (level
), the update
array is adjusted to point to the head node for the higher levels, and the list's level is updated.
Finally, we create and insert the new node.
new_node = Node(value, lvl)
for i in range(lvl + 1):
new_node.forward[i] = update[i].forward[i]
update[i].forward[i] = new_node
A new node is created with the chosen level.
The forward
pointers of the new node are set to the corresponding nodes in the update
list.
Finally, the update
list is used to update the forward
pointers of the nodes at each level, effectively inserting the new node at the correct position.
Here is the whole implementation:
import random
class Node:
def __init__(self, value, level):
self.value = value
self.forward = [None] * (level + 1)
class SkipList:
def __init__(self, max_level=16):
self.head = Node(-1, max_level)
self.max_level = max_level
self.level = 0
def random_level(self):
lvl = 0
while random.random() < 0.5 and lvl < self.max_level:
lvl += 1
return lvl
def insert(self, value):
update = [None] * (self.max_level + 1)
node = self.head
for i in range(self.level, -1, -1):
while node.forward[i] and node.forward[i].value < value:
node = node.forward[i]
update[i] = node
lvl = self.random_level()
if lvl > self.level:
for i in range(self.level + 1, lvl + 1):
update[i] = self.head
self.level = lvl
new_node = Node(value, lvl)
for i in range(lvl + 1):
new_node.forward[i] = update[i].forward[i]
update[i].forward[i] = new_node
Key Concepts to remember for a skip list:
- Ordered data: The list is sorted as elements are inserted in order.
- Multiple levels: The skip list has multiple layers to enable fast traversal (similar to binary search). The higher levels allow skipping over large sections of the list, reducing the search time.
- Probabilistic balancing: The levels of the nodes are determined probabilistically (via random_level), so the skip list achieves good average performance without requiring expensive balancing operations (unlike balanced trees).
4. Bloom filters: Space-efficient membership queries
Use case: Bloom filters are useful when approximate membership queries are acceptable, such as detecting spam emails or caching.
Some real world applications of bloom filters :
- Google Chrome uses Bloom filters in its ‘Safe Browsing’ feature. The Bloom filter helps quickly check if a URL is potentially harmful (such as known phishing or malware sites) without sending the actual URL to Google, thus enhancing privacy and speed.
- Medium uses Bloom filters in its Recommendation module to avoid showing those posts that have already been seen by the user.
- Cassandra/HBase uses bloom filters to optimize the search of data in an SSTable on the disk.
- Akamai, a global content delivery networkAuthor Of article : Axel N'cho Read full article