CS 339 shits

2023/4/21 class

sql types

avator

0.

当使用 GROUP BY 子句时，SELECT 中的列必须要么是聚合函数，要么出现在 GROUP BY 子句中。

1.HAVING用于筛选分组后的，WHERE用于筛选分组前的

2.为什么要用buffer pool而不是用OS

a.Transaction Safety

OS could flush dirty pages at any time

b.I/O Stalls

DBMS doesn’t know which pages are in memory
So the OS would stall a thread on page fault

c.Error Handling

Difficult to validate pages. Any access can cause a SIGBUS that the DBMS must handle

d.Performance Issues

OS data structure contention. TLB shootdowns

In a nutshell

→ Flushing dirty pages to disk in the correct order.
→ Specialized prefetching.
→ Buffer replacement policy.
→ Thread/process scheduling.

HEAP FILE

A heap file is an unordered collection of pages
with tuples that are stored in random order.
Need meta-data to keep track of what pages exist in multiple files and which ones have free space.

In order to do that, it has a page directory to track locations of data pages and records meta-data about avaliable space

Three Types

On-Line Transaction Processing (OLTP)(Row Store)
→ Fast operations that only read/update a small amount of data each time.

On-Line Analytical Processing (OLAP)(Column Store)
→ Complex queries that read a lot of data to compute aggregates.

Hybrid Transaction + Analytical Processing
→ OLTP + OLAP together on the same database.

NSM and DSM

refer to row-storage and colum-storage

NSM quality

Advantages
→ Fast inserts, updates, and deletes.
→ Good for queries that need the entire tuple.
Disadvantages
→ Not good for scanning large portions of the table and/or a subset of the attributes.

DSM quality

Advantages
→ Reduces the amount of wasted I/O because the DBMS only reads the data that it needs.
→ Better query processing and data compression (more on this later).
Disadvantages
→ Slow for point queries, inserts, updates, and deletes because of tuple splitting/stitching.

TUPLE IDENTIFICATION

1.fixed-length offset

Each value is the same length for an attribute.

2.Embedded Tuple Ids

Each value is stored with its tuple id in a column.

COMPRESS METHOD

GOAL

1.Must produce fixed-length values.
2.Postpone decompression for as long as possible during query execution.
3.Must be a lossless scheme.

FOUR GRANULARITY

1.Block-level

Compress a block of tuples for the same table.

2.Tuple-level

Compress the contents of the entire tuple (NSM-only).

3.Attribute-level

Compress a single attribute within one tuple (overflow).

4.Column-level

Compress multiple values for one or more attributes stored for multiple tuples (DSM-only).

COLUMNAR COMPRESSION

type	characteristics	drawbacks
Run-length Encoding	1.compress the start and length of a certain value(also the certain value itself)
Bit-Packing Encoding	When values for an attribute are always less than the value’s declared largest size, store them as smaller data type.(long long to short)
Bitmap Encoding	for a value of just two kinds, if it is positive, then store 1 otherwise store 0	only possible when value cardinality is low.
Delta Encoding	Recording the difference between values that

follow each other in the same column.

→ Store base value inline.

→ Combine with RLE to get even better compression.||
|Incremental Encoding|Type of delta encoding that avoids duplicating common prefixes/suffixes between consecutive tuples. This works best with sorted data.||
|Dictionary Encoding|Build a data structure that maps variable- length values to a smaller integer identifier.
Replace those values with their corresponding
identifier in the dictionary data structure.
→ Need to support fast encoding and decoding.
→ Need to also support range queries.||

How the DBMS manages its memory and moves data back-and-forth from disk.

two concerns

Spatial Control and Temporal Control

structure in Memory

The page table keeps track of pages that are currently in memory.
Also maintains additional meta-
data per page:
→ Dirty Flag
→ Pin/Reference Counter
avator

PAGE TABLE VS. PAGE DIRECTORY

The page directory is the mapping from
page ids to page locations in database files.
→ All changes must be recorded on disk to allow the
DBMS to find them on restart.

The page table is the mapping from page ids
to a copy of the page in buffer pool frames. → This is an in-memory data structure that does not

need to be stored on disk.

ALLOCATION POLICIES

Global Policies:

→ Make decisions for all active queries.

Local Policies:

→ Allocate frames to specific queries without considering the behavior of concurrent queries.

→ Still need to support sharing pages.

MULTIPLE BUFFER POOL

avator

PRE-FETCHING

Sequential Scan + Index Scan

Queries can reuse data retrieved from
storage or operator computations.

If a query wants to scan a table and another query is already doing this, then the DBMS will attach the second query’s cursor to the existing cursor.

BUFFER POOL BYPASS

The sequential scan operator will not store
fetched pages in the buffer pool to avoid
overhead.

OS PAGE CACHE

Most disk operations go through the OS API. Unless the DBMS tells it not to, the OS maintains its own filesystem cache (aka page cache, buffer cache).

BUFFER REPLACEMENT POLICIES

When the DBMS needs to free up a frame to make room for a new page, it must decide which page to evict from the buffer pool.

1.LEAST-RECENTLY USED

2.CLOCK

3.MOST-RECENTLY USED

4.LRU-K

Track the history of last K references to each page as timestamps and compute the interval between subsequent accesses.

5.LOCALIZATION

The DBMS chooses which pages to evict on
a per query basis. This minimizes the
pollution of the buffer pool from each query. → Keep track of the pages that a query has accessed.

HASH TABLES

HASH FUNCTIONS

We want something that is fast and has a low collision rate.

SOTA:facebook XXHash

STATIC HASH TABLES

Allocate a giant array that has one slot for every element you need to store.

To find an entry, mod the key by the number of elements to find the offset in the array.

1.Linear Probe Hashing

Single giant table of slots.

Resolve collisions by linearly searching for
the next free slot in the table.

problems happens when delete

NON-UNIQUE KEYS

avator

2. ROBINHOOD HASHING

→ Each key tracks the number of positions they are
from its optimal position in the table.
→ On insert, a key takes the slot of another key if the
first key is farther away from its optimal position than the second key.

3.CUCKOO HASHING

Use multiple hash tables with different hash
function seeds.

→ On insert, check every table and pick anyone that has
a free slot.

→ If no table has a free slot, evict the element from one of
them and then re-hash it find a new location.

Dynamic hash tables

Dynamic hash tables resize themselves on
demand.

CHAINED HASHING

Maintain a linked list of buckets for each slot in the hash table.
Resolve collisions by placing all elements with the same hash key into the same bucket.

→ To determine whether an element is present, hash
to its bucket and scan for it.

→ Insertions and deletions are generalizations of
lookups.

adding at the end of the linked list

EXTENDIBLE HASH TABLE

if has extra space did not split!
look at some bits (local and global)

split when overflow

LINEAR HASHING

if a buket has splited, use the new hashing function
every round doing 2^i
at every time only 2 hash function is alive
The hash table maintains a pointer that
tracks the next bucket to split.

→ When any bucket overflows, split the bucket at the
pointer location.

Use multiple hashes to find the right bucket for a given key.
Can use different overflow criterion:

→ Space Utilization

→ Average Length of Overflow Chains

ABOVE: use new hash function

BELOW: use old hash function

B TREE

for a M-way B tree, the number of keys is in [M/2 -1, M-1] so the number of children is between [M/2,M]

PROPERTIES

1.Key order property

2.Half-full property

Every node other than the root is at least half-full
M/2-1 ≤ #keys ≤ M-1

3.Balance property

It is perfectly balanced (i.e., every leaf node is at
the same depth in the tree)

4.Separator keys

t keys and t+1 pointers

Time complexity

query: log(n)

delete: log(n)

insert: log(n)

two approaches of leaf data

1.store RecordID

2.store Tuple Data

Insert

1.find the corresponding leaf

2.if the leaf is not full, add into it

3.otherwise, split the node, and add one extra midlle value node to its father

4.repeat this operation until root node

Deletion

Start at root, find leaf L where entry belongs. Remove the entry.
If L is at least half-full, done!
If L has only M/2-1 entries,
→ Try to re-distribute, borrowing from sibling (adjacent
node with same parent as L).
→ If re-distribution fails, merge L and sibling.
If merge occurred, must delete entry (pointing to L or sibling) from parent of L.

DUPLICATE KEY METHODS

Approach #1: Append Record ID

→ Add the tuple’s unique Record ID as part of the key to ensure that all keys are unique.

→ The DBMS can still use partial keys to find tuples.

Approach #2: Overflow Leaf Nodes

→ Allow leaf nodes to spill into overflow nodes that contain the duplicate keys.

→ This is more complex to maintain and modify.

B+ TREE DESIGN CHOICE

NODE SIZE

The slower the storage device, the larger the
optimal node size for a B+Tree.
→ HDD: ~1MB
→ SSD: ~10KB

→ In-Memory: ~512B

Optimal sizes can vary depending on the
workload
→ Leaf Node Scans vs. Root-to-Leaf Traversals

MERGE THRESHOLD

Some DBMSs do not always merge nodes when they are half full.
Delaying a merge operation may reduce the amount of reorganization.
It may also be better to just let smaller nodes exist and then periodically rebuild entire tree.

VARIABLE-LENGTH KEYS

avator

INTRA-NODE SEARCH

Linear
Binary
Interpolate

LOCKS AND LATCHES

avator

LATECH MODES

avator

LATCH IMPLEMENTATION

1. OS MUTEX

Simple to use and Non-scalable

2. READER WRITER LATCHES

→ Allows for concurrent readers. Must manage read/ write queues to avoid starvation.

→ Can be implemented on top of spinlocks.
→ Example: std::shared_mutex

HASH TABLE LATCHING

avator

LATCH CRABBING/COUPLING

avator

BOTTLENECK

Taking a write latch on the root every time becomes a bottleneck with higher concurrency.

BETTER LATCHING ALGORITHM

avator

TOP-K HEAP SORT

when duplicated keys comes in, double its size and insert

EXTERNAL MERGE SORT

avator

two way

avator

general

avator

k-way means how many shits merge per phase

B-1 since we have a extra buffer pool to write our answer

EXTERNAL HASHING AGGREGATION

avator

JOIN ALGORITHM

there types
(read slides more)

Query Execution

Three approaches

Iterator Model

Materialization Model

Better for OLTP workloads because queries
only access a small number of tuples at a
time.

Not good for OLAP queries with large intermediate results.

Vectorization Model

Ideal for OLAP queries because it greatly reduces the number of invocations per operator.
Allows for operators to more easily use vectorized (SIMD) instructions to process batches of tuples.

PLAN Processing Direction

ACCESS METHOD

the way DBMS accesses the data that stored in a table

contains three ways

Pre-computed aggregates for the attribute values in a page. DBMS checks the zone map first to decide whether to access the page.

Sequential Scan

ZONE MAPS

pre calculate the attributes in a page

INDEX Scan

Multi-Index Scan

PARALLEL VS DISTRIBUTED

Parallel DBMSs

PARALLEL

Inter-Query Intra-Query PARALLELISM

Execute multiple disparate queries simultaneously.
→ Increases throughput & reduces latency.

Intra-Query PARALLELISM

Execute the operations of a
single query in parallel.
→ Decreases latency for long-running queries,
especially for OLAP queries.

Intra-Operator(Horizontal)

Decompose operators into independent fragments that perform the same function on different subsets
of data.

EXCHANGE OPERATOR

Gather:Combine the results from multiple workers into a single output stream.

Distribute: Split a single input stream into multiple output streams.

Repartition: Shuffle multiple input streams across multiple output streams.

Inter-Operator(Vertical)

Every Operator do its own thing(Join,Scan) relentlessly

Bushy way

combination of the previous two

I/O PARALLISM

RAID 0

QUERY OPTIMIZATION

LOGICAL PLAN OPTIMIZER

Transform a logical plan into an equivalent logical plan using pattern matching rules.
The goal is to increase the likelihood of enumerating the optimal plan in the search.
Cannot compare plans because there is no cost model but can “direct” a transformation to a preferred side.

Split Conjective Predicates

Instead of doing a Huge Predicates, do some splited Predicates instead.

Predicate Pushdown

filter as much data as possible

Replace Cartisan Product with Join

using Join more

Projection Pushdown

do as what it says

DECOMPOSING QUERY

like nested subquery, we just decomposing them and just run them once. Some system would run the sub query first(mysql did that),then go to optimizer

EXPRESSION REWRITE

traversing the query plan and find out if there’s a way to optimize that

COST ESTIMATION

too expensive to run every possible plan to determine this information

STATISTICS

The DBMS stores internal statistics about tables,attributes, and indexes in its internal catalog

SELECTION CARINALITY

Equality Predicate: A=constant

ways to store them

HISTOGRAM

problem: store every single key, so if values are getting more and more,

ways: to use some buket to store them

equal width and equal depth (variaing width or not)

SKETCHES

Bloom filter

SAMPLING

collect samples from table to estimate selectivities

SINGLE RELATION QUERY PLANNING

choose the optimal indexs

MULTI-RELATION QUERY PLANNING

SYSTEM R OPTIMIZER

Break query up into blocks and generate the logical operator for each block

CONCURRENCY CONTROL

MOTIVATION

DEFINITIONS

A txn may carry out many operations on the data retrieved from the database

ACID

ATOMICITY

1: logging

2: Shadow Paging

CONSISTENCY

1: database consistency

2: transaction consistency

ISOLATION

Won’t see any other transaction when executing the current one

Serial Schedule

A schedule that does not interleave the actions of different transactions

Equal Schedule

Same effect

CONFLICTING OPERATIONS

Read-Write Conflicts

Write-Read Conflicts

Write Write Conflicts

Conflict Serializability and View Serializability

Conclict Serializability

DEPENDENCY GRAPH

View Serializability

if the condition you want to know could been done with some conflicts, then it is View serializable

two PHASE-LOCKING

Cascading aborts

a performance issue

STRONG TWO-PHASE LOCKING

types

deadlock detection and prevention

detect and pick up victim

Deadlock prevention

instead of using a thread to detect the deadlock, we decide whether it would cause a deadlock when required for a lock

LOADING

CS 339 shits

sql types

0.

1.HAVING用于筛选分组后的，WHERE用于筛选分组前的

2.为什么要用buffer pool而不是用OS

a.Transaction Safety

b.I/O Stalls

c.Error Handling

d.Performance Issues

In a nutshell

HEAP FILE

Three Types

NSM and DSM

NSM quality

DSM quality

TUPLE IDENTIFICATION

1.fixed-length offset

2.Embedded Tuple Ids

COMPRESS METHOD

GOAL

FOUR GRANULARITY

COLUMNAR COMPRESSION

How the DBMS manages its memory and moves data back-and-forth from disk.

two concerns

structure in Memory

PAGE TABLE VS. PAGE DIRECTORY

ALLOCATION POLICIES

MULTIPLE BUFFER POOL

PRE-FETCHING

SCAN-SHARING

BUFFER POOL BYPASS

OS PAGE CACHE

BUFFER REPLACEMENT POLICIES

1.LEAST-RECENTLY USED

2.CLOCK

3.MOST-RECENTLY USED

4.LRU-K

5.LOCALIZATION

HASH TABLES

HASH FUNCTIONS

STATIC HASH TABLES

1.Linear Probe Hashing

NON-UNIQUE KEYS

2. ROBINHOOD HASHING

3.CUCKOO HASHING

Dynamic hash tables

CHAINED HASHING

EXTENDIBLE HASH TABLE

LINEAR HASHING

B TREE

PROPERTIES

Time complexity

two approaches of leaf data

Insert

Deletion

DUPLICATE KEY METHODS

B+ TREE DESIGN CHOICE

NODE SIZE

MERGE THRESHOLD

VARIABLE-LENGTH KEYS

INTRA-NODE SEARCH

LOCKS AND LATCHES

LATECH MODES

LATCH IMPLEMENTATION

1. OS MUTEX

2. READER WRITER LATCHES

HASH TABLE LATCHING

LATCH CRABBING/COUPLING

BOTTLENECK

BETTER LATCHING ALGORITHM

TOP-K HEAP SORT

EXTERNAL MERGE SORT

two way

general

EXTERNAL HASHING AGGREGATION

JOIN ALGORITHM

Query Execution

Iterator Model

Materialization Model