- Published on
Huffman Coding: Efficient Data Compression
- Authors
-
-
- Name
- Jitendra M
- @_JitendraM
-
Table of contents:
-
Introduction
-
What is Huffman Coding?
-
How Huffman Coding Works
-
Time Complexity
-
Space Complexity
- Code Implementation
- Pros and Cons
-
Real-World Applications
-
Comparison with Other Compression Algorithms
-
Conclusion

Introduction
Huffman coding is a widely used method for lossless data compression. It’s a variable-length prefix coding algorithm that assigns shorter codes to more frequent symbols in a dataset, resulting in efficient data compression. This article explores the fundamentals of Huffman coding, how it works, its time complexity, pros and cons, real-world applications, and comparison with other compression algorithms.
What is Huffman Coding?
Huffman coding, created by David A. Huffman in 1952, is a variable-length prefix coding algorithm.
Huffman coding is a form of entropy encoding used for lossless data compression. It’s particularly effective for compressing text and multimedia files. Huffman coding creates a prefix-free binary tree where shorter codes represent more common characters or sequences. It’s used in various data compression formats, including DEFLATE, which is used in ZIP files and PNG images.
How Huffman Coding Works
Huffman coding works by constructing a binary tree known as a Huffman tree based on the frequency of characters in the input data. Here’s how it works:
-
Frequency Count: Calculate the frequency of each character or symbol in the input data.
-
Create Nodes: Create a leaf node for each character or symbol, with a weight equal to its frequency.
-
Build Huffman Tree: Combine the two nodes with the lowest weights into a new internal node, repeating this process until there’s only one node left – the root of the Huffman tree.
-
Assign Codes: Traverse the tree to assign unique binary codes to each character. Left branches represent binary 0, and right branches represent binary 1.
-
Encode Data: Replace the original characters in the input data with their corresponding Huffman codes.
-
Decoding: Use the Huffman tree to decode the compressed data by traversing it according to the encoded binary stream.
Time Complexity
The time complexity of constructing a Huffman tree is O(n log n)
, where n is the number of characters in the input data. Decoding, on the other hand, is typically O(log n)
, making it an efficient compression method.
Space Complexity
Huffman coding requires space to store the frequency table and the Huffman tree. The space complexity is O(n)
due to the need to store frequency information for each character.
Code Implementation
Pseudo Code
Here is a simple pseudo code representation of the Huffman coding algorithm:
function buildHuffmanTree(data):
freq_map = count_frequencies(data)
priority_queue = create_priority_queue(freq_map)
while priority_queue.size() > 1:
node1 = priority_queue.pop()
node2 = priority_queue.pop()
merged_node = merge_nodes(node1, node2)
priority_queue.push(merged_node)
return priority_queue.pop()
Example
Let’s consider a simple example to illustrate Huffman coding. Suppose we want to encode the following string: “ABRACADABRA.”
- Frequency Count:
- A: 5
- B: 2
- R: 2
- C: 1
- D: 1
The Huffman tree is constructed as follows:
ABRACADABRA
/ | \
A(5) R(2) B(2)
/ \
R(1) C(1)
/ \
D(1) -
The binary codes assigned are:
- A: 0
- R: 10
- B: 110
- C: 1110
- D: 1111
Now, ABRACADABRA
is encoded as 0101010110110110011110
Python Implementation
import heapq
from collections import defaultdict, Counter
class HuffmanNode:
def __init__(self, char, freq):
self.char = char
self.freq = freq
self.left = None
self.right = None
def __lt__(self, other):
return self.freq < other.freq
def build_huffman_tree(data):
frequency = Counter(data)
heap = [HuffmanNode(char, freq) for char, freq in frequency.items()]
heapq.heapify(heap)
while len(heap) > 1:
left = heapq.heappop(heap)
right = heapq.heappop(heap)
parent = HuffmanNode(None, left.freq + right.freq)
parent.left, parent.right = left, right
heapq.heappush(heap, parent)
return heap[0]
def build_huffman_codes(root, current_code, huffman_codes):
if not root:
return
if root.char:
huffman_codes[root.char] = current_code
build_huffman_codes(root.left, current_code + '0', huffman_codes)
build_huffman_codes(root.right, current_code + '1', huffman_codes)
def huffman_encoding(data):
root = build_huffman_tree(data)
huffman_codes = {}
build_huffman_codes(root, '', huffman_codes)
encoded_data = ''.join(huffman_codes[char] for char in data)
return encoded_data, root
def huffman_decoding(encoded_data, root):
decoded_data = []
current_node = root
for bit in encoded_data:
if bit == '0':
current_node = current_node.left
else:
current_node = current_node.right
if not current_node.left and not current_node.right:
decoded_data.append(current_node.char)
current_node = root
return ''.join(decoded_data)
# Example usage
if __name__ == "__main__":
data = "this is an example for huffman encoding"
encoded_data, tree = huffman_encoding(data)
print(f"Encoded data: {encoded_data}")
decoded_data = huffman_decoding(encoded_data, tree)
print(f"Decoded data: {decoded_data}")
C++ Implementation
#include <iostream>
#include <queue>
#include <unordered_map>
#include <vector>
using namespace std;
class HuffmanNode {
public:
char data;
int freq;
HuffmanNode* left;
HuffmanNode* right;
HuffmanNode(char data, int freq) {
this->data = data;
this->freq = freq;
left = nullptr;
right = nullptr;
}
~HuffmanNode() {
delete left;
delete right;
}
};
class CompareHuffmanNodes {
public:
bool operator()(HuffmanNode* a, HuffmanNode* b) {
return a->freq > b->freq;
}
};
HuffmanNode* buildHuffmanTree(const string& data) {
unordered_map<char, int> freqMap;
for (char c : data) {
freqMap[c]++;
}
priority_queue<HuffmanNode*, vector<HuffmanNode*>, CompareHuffmanNodes> minHeap;
for (auto& pair : freqMap) {
minHeap.push(new HuffmanNode(pair.first, pair.second));
}
while (minHeap.size() != 1) {
HuffmanNode* left = minHeap.top();
minHeap.pop();
HuffmanNode* right = minHeap.top();
minHeap.pop();
HuffmanNode* parent = new HuffmanNode('$', left->freq + right->freq);
parent->left = left;
parent->right = right;
minHeap.push(parent);
}
return minHeap.top();
}
void buildHuffmanCodes(HuffmanNode* root, string code, unordered_map<char, string>& huffmanCodes) {
if (!root) {
return;
}
if (root->data != '$') {
huffmanCodes[root->data] = code;
}
buildHuffmanCodes(root->left, code + "0", huffmanCodes);
buildHuffmanCodes(root->right, code + "1", huffmanCodes);
}
string huffmanEncoding(const string& data) {
HuffmanNode* root = buildHuffmanTree(data);
unordered_map<char, string> huffmanCodes;
buildHuffmanCodes(root, "", huffmanCodes);
string encodedData = "";
for (char c : data) {
encodedData += huffmanCodes[c];
}
return encodedData;
}
string huffmanDecoding(const string& encodedData, HuffmanNode* root) {
string decodedData = "";
HuffmanNode* currentNode = root;
for (char bit : encodedData) {
if (bit == '0') {
currentNode = currentNode->left;
} else {
currentNode = currentNode->right;
}
if (currentNode->data != '$') {
decodedData += currentNode->data;
currentNode = root;
}
}
return decodedData;
}
int main() {
string data = "this is an example for huffman encoding";
string encodedData = huffmanEncoding(data);
cout << "Encoded data: " << encodedData << endl;
HuffmanNode* root = buildHuffmanTree(data);
string decodedData = huffmanDecoding(encodedData, root);
cout << "Decoded data: " << decodedData << endl;
delete root;
return 0;
}
Pros and Cons
Pros
- High compression efficiency.
- Simple and widely used.
- Fast encoding and decoding.
Cons
- Not suitable for all types of data.
- Encoding process requires knowledge of the entire input data.
Real-World Applications
Huffman coding is used in various applications and compression formats, including:
- Text compression in file formats like ZIP.
- Image compression in formats like PNG.
- Multimedia data compression in MP3 and MPEG.
Comparison with Other Compression Algorithms
Huffman coding is known for its simplicity and efficiency, but it’s not the only compression algorithm. When compared to other algorithms like Lempel-Ziv-Welch (LZW)
and Burrows-Wheeler Transform (BWT)
, Huffman coding excels in terms of simplicity and speed. However, its compression ratio may not always be the best.
Conclusion
Huffman coding is a fundamental technique in data compression. It efficiently reduces the size of data by assigning shorter codes to more frequent symbols. While it may not always achieve the highest compression ratios, its simplicity and speed make it a popular choice in various applications.