Data Structures

I am actually thinking of collapsing all of the nested files / folders that each represent a data structure into this one huge document. This way I can see everything all at once laid out in front of me and make use of the toggling feature of this site along with the floating table of contents and margin-notes!

The inspiration is my computer vision notes.

at a glance:

Data Structure Time Complexity Space Complexity (Worst)
Average Worst
Access Search Insertion Deletion Access Search Insertion Deletion
Array Θ(1) Θ(n) Θ(n) Θ(n) O(1) O(n) O(n) O(n) O(n)
Stack Θ(n) Θ(n) Θ(1) Θ(1) O(n) O(n) O(1) O(1) O(n)
Queue Θ(n) Θ(n) Θ(1) Θ(1) O(n) O(n) O(1) O(1) O(n)
Singly-Linked List Θ(n) Θ(n) Θ(1) Θ(1) O(n) O(n) O(1) O(1) O(n)
Doubly-Linked List Θ(n) Θ(n) Θ(1) Θ(1) O(n) O(n) O(1) O(1) O(n)
Hash Table N/A Θ(1) Θ(1) Θ(1) N/A O(n) O(n) O(n) O(n)
Binary Search Tree Θ(log(n)) Θ(log(n)) Θ(log(n)) Θ(log(n)) O(n) O(n) O(n) O(n) O(n)
B-Tree Θ(log(n)) Θ(log(n)) Θ(log(n)) O(log(n)) O(log(n)) O(log(n)) O(log(n)) O(log(n)) O(n)
Red-Black Tree Θ(log(n)) Θ(log(n)) Θ(log(n)) Θ(log(n)) O(log(n)) O(log(n)) O(log(n)) O(log(n)) O(n)
AVL Tree Θ(log(n)) Θ(log(n)) Θ(log(n)) Θ(log(n)) O(log(n)) O(log(n)) O(log(n)) O(log(n)) O(n)

primitives

schematic/s

ieee 754 floating point representation
two's complement for a 4 bit value

python

type example
int 32
float 3.2
complex (3+2j)
bool False
str "Hello World!"
bytes b'\x00\x00\x00'
NoneType None

java

type size (bits) description usage
byte 8 -128 to 127 byte b = 100;
short 16 -32,768 to 32,767 short s = 1000;
int 32 -$2^{31}$ to $2^{31}-1$ int i = 12345;
long 64 -$2^{63}$ to $2^{63}-1$ long l = 123456789L;
float 32 single-precision floating number float f = 3.14f;
double 64 double-precision floating number double d = 3.14159;
char 16 single 16-bit unicode character char c = 'A';
boolean 1bit or 1byte true or false boolan flag = true;

c

type size (bits) description usage
char typically 8 character type (smallest addressable unit) char c = 'A';
short typically 16 short integer short s = 1000;
int typically 32 integer number int i = 42;
long typically 32 or 64 long integer long l = 123456L;
long long typically $\geq$64 extended-precision integer long long ll = 123456789LL;
float typically 32 single-precision floating number float f = 3.14f;
double typically 64 double-precision floating number double d = 3.14159;
long double $\geq$80 (platform-dependent) extended-precision floating number long double ld = 3.141592653589793L;
_Bool typically 8 boolean value (true or false, since C99) _Bool flag = 1; or bool flag = true;
void represents “no value” or “no type” void functionName();

arrays

schematic

time complexity

Access Search Insertion Deletion
Average $\mathcal{O}(1)$ $\mathcal{O}(n)$ $\mathcal{O}(n)$ $\mathcal{O}(n)$
Worst . . . .

explanations are warranted for these. access will be thought of as the time complexity required to sequentially access the $k$th item in the data-structure.

to "access" the $k$th item we can index into the array: constant time

searching for a particular key is not something we can do intelligently in this contiguous block of memory, so we must check all $n$ items.

insertion would on average take $\dfrac n2$ time, but because we are working with asymptotics, the constant disappears. the worst case is insertion at the front of the array with every subsequent item having to be moved into the $k+1$th position.

deletion follows a similiar argument with worst-case being deletion of the first element, and the average case decaying to the worst-case bounds asymptotically.

space complexity

this is less of a question when you have a data structure, as opposed to an algorithm, because with $n$ elements you will have to store all of them uniquely. as such for arrays and all data structures on this page we have $\mathcal{O}(n)$ space complexity.

operations

naturally, whilst we are considering access, search, insertion and deletion operations on this page, an array is represented in Python as a list. these lists have the following methods:

method explanation
append() adds an element to the end of the list
clear() removes all the elements from the list
copy() returns a copy of the list
count(arg) returns the number of elements with value of arg
extend(another_list) add the elements of another list to this list
index(arg) return index of first element with arg value
insert(arg) add element at arg position
pop(arg) remove element at arg position
remove() remove first item with specified value
reverse() reverses the order of the list
sort() sorts the list in place. mutates array

linked lists

singly linked list

schematic

time complexity
Access Search Insertion Deletion
Average $\mathcal{O}(n)$ $\mathcal{O}(n)$ $\mathcal{O}(1)$ $\mathcal{O}(1)$
Worst . . . .

once again, explanations are necessary:

  • to access a $k$th item, we need to start at the head and make our way to this item; hence $\mathcal{O}(n)$ for access
  • same for search; even if you know which item you want, there is no way of knowing where it is. you have to traverse from the head.
  • the very act of insertion will take constant time. if you wish to find a "middle" item and then insert there, the complexity would be $\mathcal{O}(n)$, but here we are decoupling the operations
  • same as above. simply freeing memory / deleting a node will take constant time.

doubly linked lists

schematic

time complexity
Access Search Insertion Deletion
Average $\mathcal{O}(n)$ $\mathcal{O}(n)$ $\mathcal{O}(1)$ $\mathcal{O}(1)$
Worst . . . .

with respect to the asymptotics of operations, the doubly linked list provides no advantage over the singly linked list.

tradeoffs

non-contiguous use of memory is an advantage in terms of finding more memory for nodes, but is also a disadvantage in terms of traversal.

strings

strings really are just special cases of arrays. but because their operations vary so wildly from the usual array operations:

  • concatenation
  • joining
  • comparison
  • splitting
  • searching for substrings

it makes sense for this topic to have its own little nook.

this page is not about algorithms, and so there is nothing really novel to add here at the moment, but a number of clever algorithms will relate back to this heading.

stacks & queues

stack

schematic
time complexity
Access Search Insertion Deletion
Average $\mathcal{O}(n)$ $\mathcal{O}(n)$ $\mathcal{O}(1)$ $\mathcal{O}(1)$
Worst . . . .

as expected:

  • random access will take $n$ steps. presumably here we are using a linked list implementation though and so arbitrary accesses will always take (asymptotically) $n$ steps
  • similarly search as in an array or linked list requires $n$ steps
  • the advantage comes from inserting into the stack which always takes constant time
  • and deletion costs constant time. this is the use-case for this data structure anyhow

queue

schematic

time complexity
Access Search Insertion Deletion
Average $\mathcal{O}(n)$ $\mathcal{O}(n)$ $\mathcal{O}(1)$ $\mathcal{O}(1)$
Worst . . . .

this is the same as the stack:

  • underlying implementation details would be identical, so access behaviour wouldn't change
  • neither would search functionality
  • only the location of insertion
  • and location of deletion would change
double ended queue (deque)

this is an interesting data structure. at first I thought it was short for dequeue, but it is not. instead this structure is pronounced "deck", and is a list with 2 pointers:

hash tables / hashmaps

A map maintains insertion order.

schematic

time complexity

Access Search Insertion Deletion
Average n/a $\mathcal{O}(1)$ $\mathcal{O}(1)$ $\mathcal{O}(1)$
Worst . $\mathcal{O}(n)$ $\mathcal{O}(n)$ $\mathcal{O}(n)$

obviously the time-complexity depends on the data structure and the way in which it is implemented from an atomic operations point of view.

having said this, hash-tables are sort of a mystery to me at the moment. they have varying implementations and even then I need to study each to understand what the associated compute would look like for each method.

for now I have just copied down what was given at bigocheatsheet.com

trees

schematic

Binary Search Tree

schematic

time complexity
Access Search Insertion Deletion
Average $\mathcal{O}(\log(n))$ $\mathcal{O}(\log(n))$ $\mathcal{O}(\log(n))$ $\mathcal{O}(\log(n))$
Worst $\mathcal{O}(n)$ $\mathcal{O}(n)$ $\mathcal{O}(n)$ $\mathcal{O}(n)$

naturally, this table warrants explanation. though currently I have none.

I can reconcile average complexity of search to take log n steps, but I wonder why worst case is not the same.

I wonder if insertion causes a total rebalancing to occur. is a Binary Search Tree even balanced?

Red Black Trees

schematic

another red black tree
small rb tree
time complexity
Access Search Insertion Deletion
Average $\mathcal{O}(\log(n))$ $\mathcal{O}(\log(n))$ $\mathcal{O}(\log(n))$ $\mathcal{O}(\log(n))$
Worst $\mathcal{O}(\log(n))$ $\mathcal{O}(\log(n))$ $\mathcal{O}(\log(n))$ $\mathcal{O}(\log(n))$

I've just copied these down from bigocheatsheet.com blindly to close that tab and be done with it for now.

AVL Trees

Is a binary search tree that is height balanced; for each node $x$, the heights of the lef and right subtrees of $x$ differ by at most 1.

To implement such a tree, maintain an extra attribute $h$ in each node such that $x.h$ is the height of node $x$.

Access Search Insertion Deletion
Average $\mathcal{O}(\log(n))$ $\mathcal{O}(\log(n))$ $\mathcal{O}(\log(n))$ $\mathcal{O}(\log(n))$
Worst $\mathcal{O}(\log(n))$ $\mathcal{O}(\log(n))$ $\mathcal{O}(\log(n))$ $\mathcal{O}(\log(n))$

same thing as above, here. I've just copied the time-complexities down without understanding them.

I also worry about the additional 5 'data-structures':

  1. skip lists
  2. cartesian tree
  3. b-tree (this one makes sense)
  4. splay tree
  5. kd tree

and then the short few that I have found of my own accord:

  1. tries
  2. heaps
  3. priority queue (of which heap seems to be a form of)

heaps

schematic

graphs

schematic

representations

adjacency matrix, linked lists