Data Structures
I am actually thinking of collapsing all of the nested files / folders that each represent a data structure into this one huge document. This way I can see everything all at once laid out in front of me and make use of the toggling feature of this site along with the floating table of contents and margin-notes!
The inspiration is my computer vision notes.
at a glance:
| Data Structure | Time Complexity | Space Complexity (Worst) | |||||||
|---|---|---|---|---|---|---|---|---|---|
| Average | Worst | ||||||||
| Access | Search | Insertion | Deletion | Access | Search | Insertion | Deletion | ||
| Array | Θ(1) |
Θ(n) |
Θ(n) |
Θ(n) |
O(1) |
O(n) |
O(n) |
O(n) |
O(n) |
| Stack | Θ(n) |
Θ(n) |
Θ(1) |
Θ(1) |
O(n) |
O(n) |
O(1) |
O(1) |
O(n) |
| Queue | Θ(n) |
Θ(n) |
Θ(1) |
Θ(1) |
O(n) |
O(n) |
O(1) |
O(1) |
O(n) |
| Singly-Linked List | Θ(n) |
Θ(n) |
Θ(1) |
Θ(1) |
O(n) |
O(n) |
O(1) |
O(1) |
O(n) |
| Doubly-Linked List | Θ(n) |
Θ(n) |
Θ(1) |
Θ(1) |
O(n) |
O(n) |
O(1) |
O(1) |
O(n) |
| Hash Table | N/A |
Θ(1) |
Θ(1) |
Θ(1) |
N/A |
O(n) |
O(n) |
O(n) |
O(n) |
| Binary Search Tree | Θ(log(n)) |
Θ(log(n)) |
Θ(log(n)) |
Θ(log(n)) |
O(n) |
O(n) |
O(n) |
O(n) |
O(n) |
| B-Tree | Θ(log(n)) |
Θ(log(n)) |
Θ(log(n)) |
O(log(n)) |
O(log(n)) |
O(log(n)) |
O(log(n)) |
O(log(n)) |
O(n) |
| Red-Black Tree | Θ(log(n)) |
Θ(log(n)) |
Θ(log(n)) |
Θ(log(n)) |
O(log(n)) |
O(log(n)) |
O(log(n)) |
O(log(n)) |
O(n) |
| AVL Tree | Θ(log(n)) |
Θ(log(n)) |
Θ(log(n)) |
Θ(log(n)) |
O(log(n)) |
O(log(n)) |
O(log(n)) |
O(log(n)) |
O(n) |
primitives
schematic/s
python
| type | example |
|---|---|
int |
32 |
float |
3.2 |
complex |
(3+2j) |
bool |
False |
str |
"Hello World!" |
bytes |
b'\x00\x00\x00' |
NoneType |
None |
java
| type | size (bits) | description | usage |
|---|---|---|---|
byte |
8 | -128 to 127 | byte b = 100; |
short |
16 | -32,768 to 32,767 | short s = 1000; |
int |
32 | -$2^{31}$ to $2^{31}-1$ | int i = 12345; |
long |
64 | -$2^{63}$ to $2^{63}-1$ | long l = 123456789L; |
float |
32 | single-precision floating number | float f = 3.14f; |
double |
64 | double-precision floating number | double d = 3.14159; |
char |
16 | single 16-bit unicode character | char c = 'A'; |
boolean |
1bit or 1byte ⊕ | true or false |
boolan flag = true; |
c
| type | size (bits) | description | usage |
|---|---|---|---|
char |
typically 8 | character type (smallest addressable unit) | char c = 'A'; |
short |
typically 16 | short integer | short s = 1000; |
int |
typically 32 | integer number | int i = 42; |
long |
typically 32 or 64 | long integer | long l = 123456L; |
long long |
typically $\geq$64 | extended-precision integer | long long ll = 123456789LL; |
float |
typically 32 | single-precision floating number | float f = 3.14f; |
double |
typically 64 | double-precision floating number | double d = 3.14159; |
long double |
$\geq$80 (platform-dependent) | extended-precision floating number | long double ld = 3.141592653589793L; |
_Bool |
typically 8 | boolean value (true or false, since C99) | _Bool flag = 1; or bool flag = true; |
void |
— | represents “no value” or “no type” | void functionName(); |
arrays
schematic
time complexity
| Access | Search | Insertion | Deletion | |
|---|---|---|---|---|
| Average | $\mathcal{O}(1)$ | $\mathcal{O}(n)$ | $\mathcal{O}(n)$ | $\mathcal{O}(n)$ |
| Worst | . | . | . | . |
explanations are warranted for these. access will be thought of as the time complexity required to sequentially access the $k$th item in the data-structure.
to "access" the $k$th item we can index into the array: constant time
searching for a particular key is not something we can do intelligently in this contiguous block of memory, so we must check all $n$ items.
insertion would on average take $\dfrac n2$ time, but because we are working with asymptotics, the constant disappears. the worst case is insertion at the front of the array with every subsequent item having to be moved into the $k+1$th position.
deletion follows a similiar argument with worst-case being deletion of the first element, and the average case decaying to the worst-case bounds asymptotically.
space complexity
this is less of a question when you have a data structure, as opposed to an algorithm, because with $n$ elements you will have to store all of them uniquely. as such for arrays and all data structures on this page ⊕ we have $\mathcal{O}(n)$ space complexity.
operations
naturally, whilst we are considering access, search, insertion and deletion operations on this page, an array is represented in Python as a list. these lists have the following methods:
| method | explanation |
|---|---|
append() |
adds an element to the end of the list |
clear() |
removes all the elements from the list |
copy() |
returns a copy of the list |
count(arg) |
returns the number of elements with value of arg |
extend(another_list) |
add the elements of another list to this list |
index(arg) |
return index of first element with arg value |
insert(arg) |
add element at arg position |
pop(arg) |
remove element at arg position |
remove() |
remove first item with specified value |
reverse() |
reverses the order of the list |
sort() |
sorts the list in place. mutates array |
linked lists
singly linked list
schematic
time complexity
| Access | Search | Insertion | Deletion | |
|---|---|---|---|---|
| Average | $\mathcal{O}(n)$ | $\mathcal{O}(n)$ | $\mathcal{O}(1)$ | $\mathcal{O}(1)$ |
| Worst | . | . | . | . |
once again, explanations are necessary:
- to access a $k$th item, we need to start at the
headand make our way to this item; hence $\mathcal{O}(n)$ for access - same for search; even if you know which item you want, there is no way of knowing where it is. you have to traverse from the
head. - the very act of insertion will take constant time. if you wish to find a "middle" item and then insert there, the complexity would be $\mathcal{O}(n)$, but here we are decoupling the operations
- same as above. simply freeing memory / deleting a node will take constant time.
doubly linked lists
schematic
time complexity
| Access | Search | Insertion | Deletion | |
|---|---|---|---|---|
| Average | $\mathcal{O}(n)$ | $\mathcal{O}(n)$ | $\mathcal{O}(1)$ | $\mathcal{O}(1)$ |
| Worst | . | . | . | . |
with respect to the asymptotics of operations, the doubly linked list provides no advantage over the singly linked list.
tradeoffs
non-contiguous use of memory is an advantage in terms of finding more memory for nodes, but is also a disadvantage in terms of traversal.
strings
strings really are just special cases of arrays. but because their operations vary so wildly from the usual array operations:
- concatenation
- joining
- comparison
- splitting
- searching for substrings
it makes sense for this topic to have its own little nook.
this page is not about algorithms, and so there is nothing really novel to add here at the moment, but a number of clever algorithms will relate back to this heading.
stacks & queues
stack
schematic
time complexity
| Access | Search | Insertion | Deletion | |
|---|---|---|---|---|
| Average | $\mathcal{O}(n)$ | $\mathcal{O}(n)$ | $\mathcal{O}(1)$ | $\mathcal{O}(1)$ |
| Worst | . | . | . | . |
as expected:
- random access will take $n$ steps. presumably here we are using a linked list implementation though and so arbitrary accesses will always take (asymptotically) $n$ steps
- similarly search as in an array or linked list requires $n$ steps
- the advantage comes from inserting into the stack which always takes constant time
- and deletion costs constant time. this is the use-case for this data structure anyhow
queue
schematic
time complexity
| Access | Search | Insertion | Deletion | |
|---|---|---|---|---|
| Average | $\mathcal{O}(n)$ | $\mathcal{O}(n)$ | $\mathcal{O}(1)$ | $\mathcal{O}(1)$ |
| Worst | . | . | . | . |
this is the same as the stack:
- underlying implementation details would be identical, so access behaviour wouldn't change
- neither would search functionality
- only the location of insertion
- and location of deletion would change
double ended queue (deque)
this is an interesting data structure. at first I thought it was short for dequeue, but it is not. instead this structure is pronounced "deck", and is a list with 2 pointers:
hash tables / hashmaps
A map maintains insertion order.
schematic
time complexity
| Access | Search | Insertion | Deletion | |
|---|---|---|---|---|
| Average | n/a | $\mathcal{O}(1)$ | $\mathcal{O}(1)$ | $\mathcal{O}(1)$ |
| Worst | . | $\mathcal{O}(n)$ | $\mathcal{O}(n)$ | $\mathcal{O}(n)$ |
obviously the time-complexity depends on the data structure and the way in which it is implemented from an atomic operations point of view.
having said this, hash-tables are sort of a mystery to me at the moment. they have varying implementations and even then I need to study each to understand what the associated compute would look like for each method.
for now I have just copied down what was given at bigocheatsheet.com
trees
schematic
Binary Search Tree
schematic
time complexity
| Access | Search | Insertion | Deletion | |
|---|---|---|---|---|
| Average | $\mathcal{O}(\log(n))$ | $\mathcal{O}(\log(n))$ | $\mathcal{O}(\log(n))$ | $\mathcal{O}(\log(n))$ |
| Worst | $\mathcal{O}(n)$ | $\mathcal{O}(n)$ | $\mathcal{O}(n)$ | $\mathcal{O}(n)$ |
naturally, this table warrants explanation. though currently I have none.
I can reconcile average complexity of search to take log n steps, but I wonder why worst case is not the same.
I wonder if insertion causes a total rebalancing to occur. is a Binary Search Tree even balanced?
Red Black Trees
schematic
time complexity
| Access | Search | Insertion | Deletion | |
|---|---|---|---|---|
| Average | $\mathcal{O}(\log(n))$ | $\mathcal{O}(\log(n))$ | $\mathcal{O}(\log(n))$ | $\mathcal{O}(\log(n))$ |
| Worst | $\mathcal{O}(\log(n))$ | $\mathcal{O}(\log(n))$ | $\mathcal{O}(\log(n))$ | $\mathcal{O}(\log(n))$ |
I've just copied these down from bigocheatsheet.com blindly to close that tab and be done with it for now.
AVL Trees
Is a binary search tree that is height balanced; for each node $x$, the heights of the lef and right subtrees of $x$ differ by at most 1.
To implement such a tree, maintain an extra attribute $h$ in each node such that $x.h$ is the height of node $x$.
| Access | Search | Insertion | Deletion | |
| Average | $\mathcal{O}(\log(n))$ | $\mathcal{O}(\log(n))$ | $\mathcal{O}(\log(n))$ | $\mathcal{O}(\log(n))$ |
| Worst | $\mathcal{O}(\log(n))$ | $\mathcal{O}(\log(n))$ | $\mathcal{O}(\log(n))$ | $\mathcal{O}(\log(n))$ |
same thing as above, here. I've just copied the time-complexities down without understanding them.
I also worry about the additional 5 'data-structures':
- skip lists
- cartesian tree
- b-tree (this one makes sense)
- splay tree
- kd tree
and then the short few that I have found of my own accord:
- tries
- heaps
- priority queue (of which heap seems to be a form of)
heaps
schematic
graphs
schematic
representations
adjacency matrix, linked lists