sql types
0.
当使用 GROUP BY 子句时,SELECT 中的列必须要么是聚合函数,要么出现在 GROUP BY 子句中。
1.HAVING用于筛选分组后的,WHERE用于筛选分组前的
2.为什么要用buffer pool而不是用OS
a.Transaction Safety
OS could flush dirty pages at any time
b.I/O Stalls
DBMS doesn’t know which pages are in memory
So the OS would stall a thread on page fault
c.Error Handling
Difficult to validate pages. Any access can cause a SIGBUS that the DBMS must handle
d.Performance Issues
OS data structure contention. TLB shootdowns
In a nutshell
→ Flushing dirty pages to disk in the correct order.
→ Specialized prefetching.
→ Buffer replacement policy.
→ Thread/process scheduling.
HEAP FILE
A heap file is an unordered collection of pages
with tuples that are stored in random order.
Need meta-data to keep track of what pages exist in multiple files and which ones have free space.
In order to do that, it has a page directory to track locations of data pages and records meta-data about avaliable space
Three Types
On-Line Transaction Processing (OLTP)(Row Store)
→ Fast operations that only read/update a small amount of data each time.
On-Line Analytical Processing (OLAP)(Column Store)
→ Complex queries that read a lot of data to compute aggregates.
Hybrid Transaction + Analytical Processing
→ OLTP + OLAP together on the same database.
NSM and DSM
refer to row-storage and colum-storage
NSM quality
Advantages
→ Fast inserts, updates, and deletes.
→ Good for queries that need the entire tuple.
Disadvantages
→ Not good for scanning large portions of the table and/or a subset of the attributes.
DSM quality
Advantages
→ Reduces the amount of wasted I/O because the DBMS only reads the data that it needs.
→ Better query processing and data compression (more on this later).
Disadvantages
→ Slow for point queries, inserts, updates, and deletes because of tuple splitting/stitching.
TUPLE IDENTIFICATION
1.fixed-length offset
Each value is the same length for an attribute.
2.Embedded Tuple Ids
Each value is stored with its tuple id in a column.
COMPRESS METHOD
GOAL
1.Must produce fixed-length values.
2.Postpone decompression for as long as possible during query execution.
3.Must be a lossless scheme.
FOUR GRANULARITY
1.Block-level
Compress a block of tuples for the same table.
2.Tuple-level
Compress the contents of the entire tuple (NSM-only).
3.Attribute-level
Compress a single attribute within one tuple (overflow).
4.Column-level
Compress multiple values for one or more attributes stored for multiple tuples (DSM-only).
COLUMNAR COMPRESSION
type | characteristics | drawbacks |
---|---|---|
Run-length Encoding | 1.compress the start and length of a certain value(also the certain value itself) | |
Bit-Packing Encoding | When values for an attribute are always less than the value’s declared largest size, store them as smaller data type.(long long to short) | |
Bitmap Encoding | for a value of just two kinds, if it is positive, then store 1 otherwise store 0 | only possible when value cardinality is low. |
Delta Encoding | Recording the difference between values that |
follow each other in the same column.
→ Store base value inline.
→ Combine with RLE to get even better compression.||
|Incremental Encoding|Type of delta encoding that avoids duplicating common prefixes/suffixes between consecutive tuples. This works best with sorted data.||
|Dictionary Encoding|Build a data structure that maps variable- length values to a smaller integer identifier.
Replace those values with their corresponding
identifier in the dictionary data structure.
→ Need to support fast encoding and decoding.
→ Need to also support range queries.||
How the DBMS manages its memory and moves data back-and-forth from disk.
two concerns
Spatial Control and Temporal Control
structure in Memory
The page table keeps track of pages that are currently in memory.
Also maintains additional meta-
data per page:
→ Dirty Flag
→ Pin/Reference Counter
PAGE TABLE VS. PAGE DIRECTORY
The page directory is the mapping from
page ids to page locations in database files.
→ All changes must be recorded on disk to allow the
DBMS to find them on restart.
The page table is the mapping from page ids
to a copy of the page in buffer pool frames. → This is an in-memory data structure that does not
need to be stored on disk.
ALLOCATION POLICIES
Global Policies:
→ Make decisions for all active queries.
Local Policies:
→ Allocate frames to specific queries without considering the behavior of concurrent queries.
→ Still need to support sharing pages.
MULTIPLE BUFFER POOL
PRE-FETCHING
Sequential Scan + Index Scan
SCAN-SHARING
Queries can reuse data retrieved from
storage or operator computations.
If a query wants to scan a table and another query is already doing this, then the DBMS will attach the second query’s cursor to the existing cursor.
BUFFER POOL BYPASS
The sequential scan operator will not store
fetched pages in the buffer pool to avoid
overhead.
OS PAGE CACHE
Most disk operations go through the OS API. Unless the DBMS tells it not to, the OS maintains its own filesystem cache (aka page cache, buffer cache).
BUFFER REPLACEMENT POLICIES
When the DBMS needs to free up a frame to make room for a new page, it must decide which page to evict from the buffer pool.
1.LEAST-RECENTLY USED
2.CLOCK
3.MOST-RECENTLY USED
4.LRU-K
Track the history of last K references to each page as timestamps and compute the interval between subsequent accesses.
5.LOCALIZATION
The DBMS chooses which pages to evict on
a per query basis. This minimizes the
pollution of the buffer pool from each query. → Keep track of the pages that a query has accessed.
HASH TABLES
HASH FUNCTIONS
We want something that is fast and has a low collision rate.
SOTA:facebook XXHash
STATIC HASH TABLES
Allocate a giant array that has one slot for every element you need to store.
To find an entry, mod the key by the number of elements to find the offset in the array.
1.Linear Probe Hashing
Single giant table of slots.
Resolve collisions by linearly searching for
the next free slot in the table.
problems happens when delete
approaches:
|ways|how|
|-|-|
|Re-hash|Re-hash and update the slots for all keys.
→ Nobody actually does this.|
|Tomb Stone|→ Set a marker indicating that the entry in the slot is logically deleted.
→ You can reuse the slot for inserting new keys.
→ May require periodic garbage collection.|
NON-UNIQUE KEYS
2. ROBINHOOD HASHING
→ Each key tracks the number of positions they are
from its optimal position in the table.
→ On insert, a key takes the slot of another key if the
first key is farther away from its optimal position than the second key.
3.CUCKOO HASHING
Use multiple hash tables with different hash
function seeds.
→ On insert, check every table and pick anyone that has
a free slot.
→ If no table has a free slot, evict the element from one of
them and then re-hash it find a new location.
Dynamic hash tables
Dynamic hash tables resize themselves on
demand.
CHAINED HASHING
Maintain a linked list of buckets for each slot in the hash table.
Resolve collisions by placing all elements with the same hash key into the same bucket.
→ To determine whether an element is present, hash
to its bucket and scan for it.
→ Insertions and deletions are generalizations of
lookups.
adding at the end of the linked list
EXTENDIBLE HASH TABLE
if has extra space did not split!
look at some bits (local and global)
split when overflow
LINEAR HASHING
if a buket has splited, use the new hashing function
every round doing 2^i
at every time only 2 hash function is alive
The hash table maintains a pointer that
tracks the next bucket to split.
→ When any bucket overflows, split the bucket at the
pointer location.
Use multiple hashes to find the right bucket for a given key.
Can use different overflow criterion:
→ Space Utilization
→ Average Length of Overflow Chains
ABOVE: use new hash function
BELOW: use old hash function
B TREE
for a M-way B tree, the number of keys is in [M/2 -1, M-1] so the number of children is between [M/2,M]
PROPERTIES
1.Key order property
2.Half-full property
Every node other than the root is at least half-full
M/2-1 ≤ #keys ≤ M-1
3.Balance property
It is perfectly balanced (i.e., every leaf node is at
the same depth in the tree)
4.Separator keys
t keys and t+1 pointers
Time complexity
query: log(n)
delete: log(n)
insert: log(n)
two approaches of leaf data
1.store RecordID
2.store Tuple Data
Insert
1.find the corresponding leaf
2.if the leaf is not full, add into it
3.otherwise, split the node, and add one extra midlle value node to its father
4.repeat this operation until root node
Deletion
Start at root, find leaf L where entry belongs. Remove the entry.
If L is at least half-full, done!
If L has only M/2-1 entries,
→ Try to re-distribute, borrowing from sibling (adjacent
node with same parent as L).
→ If re-distribution fails, merge L and sibling.
If merge occurred, must delete entry (pointing to L or sibling) from parent of L.
DUPLICATE KEY METHODS
Approach #1: Append Record ID
→ Add the tuple’s unique Record ID as part of the key to ensure that all keys are unique.
→ The DBMS can still use partial keys to find tuples.
Approach #2: Overflow Leaf Nodes
→ Allow leaf nodes to spill into overflow nodes that contain the duplicate keys.
→ This is more complex to maintain and modify.
B+ TREE DESIGN CHOICE
NODE SIZE
The slower the storage device, the larger the
optimal node size for a B+Tree.
→ HDD: ~1MB
→ SSD: ~10KB
→ In-Memory: ~512B
Optimal sizes can vary depending on the
workload
→ Leaf Node Scans vs. Root-to-Leaf Traversals
MERGE THRESHOLD
Some DBMSs do not always merge nodes when they are half full.
Delaying a merge operation may reduce the amount of reorganization.
It may also be better to just let smaller nodes exist and then periodically rebuild entire tree.
VARIABLE-LENGTH KEYS
INTRA-NODE SEARCH
Linear
Binary
Interpolate
LOCKS AND LATCHES
LATECH MODES
LATCH IMPLEMENTATION
1. OS MUTEX
Simple to use and Non-scalable
2. READER WRITER LATCHES
→ Allows for concurrent readers. Must manage read/ write queues to avoid starvation.
→ Can be implemented on top of spinlocks.
→ Example: std::shared_mutex
HASH TABLE LATCHING
LATCH CRABBING/COUPLING
BOTTLENECK
Taking a write latch on the root every time becomes a bottleneck with higher concurrency.
BETTER LATCHING ALGORITHM
TOP-K HEAP SORT
when duplicated keys comes in, double its size and insert
EXTERNAL MERGE SORT
two way
general
k-way means how many shits merge per phase
B-1 since we have a extra buffer pool to write our answer
EXTERNAL HASHING AGGREGATION
JOIN ALGORITHM
there types
(read slides more)
Query Execution
Three approaches
Iterator Model
Materialization Model
Better for OLTP workloads because queries
only access a small number of tuples at a
time.
Not good for OLAP queries with large intermediate results.
Vectorization Model
Ideal for OLAP queries because it greatly reduces the number of invocations per operator.
Allows for operators to more easily use vectorized (SIMD) instructions to process batches of tuples.
PLAN Processing Direction
ACCESS METHOD
the way DBMS accesses the data that stored in a table
contains three ways
Pre-computed aggregates for the attribute values in a page. DBMS checks the zone map first to decide whether to access the page.
Sequential Scan
ZONE MAPS
pre calculate the attributes in a page
INDEX Scan
Multi-Index Scan
PARALLEL VS DISTRIBUTED
Parallel DBMSs
PARALLEL
Inter-Query Intra-Query PARALLELISM
Execute multiple disparate queries simultaneously.
→ Increases throughput & reduces latency.
Intra-Query PARALLELISM
Execute the operations of a
single query in parallel.
→ Decreases latency for long-running queries,
especially for OLAP queries.
Intra-Operator(Horizontal)
Decompose operators into independent fragments that perform the same function on different subsets
of data.
EXCHANGE OPERATOR
Gather:Combine the results from multiple workers into a single output stream.
Distribute: Split a single input stream into multiple output streams.
Repartition: Shuffle multiple input streams across multiple output streams.
Inter-Operator(Vertical)
Every Operator do its own thing(Join,Scan) relentlessly
Bushy way
combination of the previous two
I/O PARALLISM
RAID 0
QUERY OPTIMIZATION
LOGICAL PLAN OPTIMIZER
Transform a logical plan into an equivalent logical plan using pattern matching rules.
The goal is to increase the likelihood of enumerating the optimal plan in the search.
Cannot compare plans because there is no cost model but can “direct” a transformation to a preferred side.
Split Conjective Predicates
Instead of doing a Huge Predicates, do some splited Predicates instead.
Predicate Pushdown
filter as much data as possible
Replace Cartisan Product with Join
using Join more
Projection Pushdown
do as what it says
DECOMPOSING QUERY
like nested subquery, we just decomposing them and just run them once. Some system would run the sub query first(mysql did that),then go to optimizer
EXPRESSION REWRITE
traversing the query plan and find out if there’s a way to optimize that
COST ESTIMATION
too expensive to run every possible plan to determine this information
STATISTICS
The DBMS stores internal statistics about tables,attributes, and indexes in its internal catalog
SELECTION CARINALITY
Equality Predicate: A=constant
ways to store them
HISTOGRAM
problem: store every single key, so if values are getting more and more,
ways: to use some buket to store them
equal width and equal depth (variaing width or not)
SKETCHES
Bloom filter
SAMPLING
collect samples from table to estimate selectivities
SINGLE RELATION QUERY PLANNING
choose the optimal indexs
MULTI-RELATION QUERY PLANNING
SYSTEM R OPTIMIZER
Break query up into blocks and generate the logical operator for each block
CONCURRENCY CONTROL
MOTIVATION
DEFINITIONS
A txn may carry out many operations on the data retrieved from the database
ACID
ATOMICITY
1: logging
2: Shadow Paging
CONSISTENCY
1: database consistency
2: transaction consistency
ISOLATION
Won’t see any other transaction when executing the current one
Serial Schedule
A schedule that does not interleave the actions of different transactions
Equal Schedule
Same effect
CONFLICTING OPERATIONS
Read-Write Conflicts
Write-Read Conflicts
Write Write Conflicts
Conflict Serializability and View Serializability
Conclict Serializability
DEPENDENCY GRAPH
View Serializability
if the condition you want to know could been done with some conflicts, then it is View serializable
two PHASE-LOCKING
Cascading aborts
a performance issue
STRONG TWO-PHASE LOCKING
types
deadlock detection and prevention
detect and pick up victim
Deadlock prevention
instead of using a thread to detect the deadlock, we decide whether it would cause a deadlock when required for a lock
Granularity
Intention Locks
TimeStamp Ordering
It is a optimistic way
Basic T/O
OPTIMISTIC CONCURRENCY CONTROL(OCC)
OCC PHASES
VALIDATION PHASE
STEP I
STEP II
STEP III
PHANTON PROBLEM SOLUTION
isolation level
MVCC
MVCC VERSION STORAGE
garbage collection
LOGGING
STORAGE TYPE
FAILURE
transaction failure:
system failure:
storage media failure
redo and undo log
STEAL POLICY
FORCE POLICY
SHADOW PAGING
WAL
LOGGING SCHEMA
CHECKPOINT
RECOVERY
ARIES:Algorithm for Recovery and Isolation Exploiting Semantics