Indexing decisions: hashing out and choosing among unique options
By Scott Nelson
Oracle offers a number of different types of indexes that can be created to provide a means to expedite access to data stored in the database. It is up to the DBA or Designer to determine which type of index should be implemented to best satisfy the performance requirements and provide the optimal access paths to the data.
Well-designed indexes avoid full table scans thereby decreasing access times. Unique indexes insure the uniqueness of indexed attributes. Certain index types offer direct support for joins while certain index types support accessing table rows in a particular order. In general, indexes are the best and easiest mechanism for optimizing a database.
However, poorly designed indexes may cause query execution to take longer than performing full table scans and they may require a large amount of disk space. Additionally having to many indexes or poorly chosen indexes on a table may adversely affect update operations.
B+ Tree Indexes
These are the most common indexes because they provide performance and functionality that benefits the widest variety of applications and data access methods. The full explanation of the storage structure of a B-tree index is beyond the scope of this paper, but briefly B-tree indexes are stored in a tree structure that has branch blocks which point to lower-level blocks. The lower level index blocks are called leaf blocks that contain the indexed data value and a corresponding ROWID. The ROWID is a pointer that points directly back to the row in the data table.
One of the principal performance benefits of indexes is realized when an index is accessed and the data needed to satisfy the query resides in the index there is no need access the table. With an index organized table the index can serve as the source of data, rather than the table. A B-tree index can dramatically improve response times when an application or user selects data where:
- A where clause value equals a specific key value
- Index values are unique or close to unique
- Rows are within a range of values
B-tree indexes improve the performance of queries that select a small percentage of rows from a table. Typically if more than 15% of the data will be returned an index should not be used as the performace hit of accessing the index and the table exceed accessing the entire table (full table scan).
In a nutshell:
- Structure is an inverted tree with one root node and several interior and leaf nodes.
- B+ trees are balanced in that all leaf nodes are equidistant from the root node.
All nodes have a similar structure containing indexed values and pointers to other index nodes or data blocks. (fig 1)
- All root and interior pointers reference other index nodes.
- All leaf pointers reference data blocks except for the last which points to the next sequential index block.
- Depth of tree dynamically grows and shrinks with data.
- Depth of tree grows and shrinks at much slower rate than data.
- Top few levels of the tree may be cached.
- Leaf node values are dense having a value-pointer combination for every data value in the index.
- The value slots in the root and interior nodes are compacted.
- Completely NULL values are not stored but composite indexed attributes with at least one non-NULL attribute value are stored.
B+ Tree Indexes have many advantages and no severe disadvantages. The disadvantages are disk space usage and small performance degradation with large table growth. They have been in practical use for years and are well understood. Except for a few notable cases, they should be the index of choice. They support fast random access on all or portions of the indexed attributes; accessing one data row requires at most the depth of the B+ tree plus one accesses. They support sequential access, range queries and sorting on all or a portion of the indexed attributes. They efficiently check for the existence or nonexistence of data records having specific values for the indexed attributes and, in some cases, can completely answer a query.
Clustering refers to the practice of grouping one or more rows from possibly more than one table in the same data block. The grouping is accomplished via attributes which all the tables share in common; these attributes are known as the cluster key. Records with a common cluster key value are stored in the same block(s). A cluster is one or more tables that share the same data blocks because they share common columns and are often used together.
There are two primary benefits of clustering data:
- Disk I/O is reduced and access time improves for joins of clustered tables.
- Disk storage is reduced. Each cluster key value is stored only once each in the cluster and the cluster index, no matter how many rows of different tables contain the value.
There are two different types of clusters: clustered by the value of one or more attributes or clustered by the value of a function applied to one or more attributes.
- First type, indexed clustering, requires B+ tree index on the cluster key.
- Second type, hash clustering, is also known as hashing and requires no auxiliary index structure.
A clustered index is a sparse B+ tree index on the distinct values of the cluster key. Clustering offers fast access for rows of one or more tables which are frequently accessed together, e.g., joins, but not inserted together.. Use indexed clusters for critical transactions where several rows of one or more tables are accessed together using equality on the full cluster key and those rows will not generally be in the same block, A side benefit is that a cluster may conserve space as the cluster key is not repeated.
For data warehouse applications hash clusters can improve the performance of data retrieval. A hash function is used to generate a distribution of numeric values called hash values, which are based on specific cluster key values. To find or store a row in a hash cluster, the hash function is applied to the row’s cluster key value. The hash value returned corresponds to a data block in the cluster, which is read or written to on behalf of the issued statement.
If all goes well, clusters guarantee an access time almost as fast as rowid access (1 block read). A single hash bucket is devoted to the hashed value of fully NULL keys should any exist. One of the main problem with hash clusters is that the number of clusters is fixed for the lifetime of the cluster in that the number of hash buckets cannot grow beyond those initially allocated. Therefore tables in a hash cluster should be relatively static or at least possess a bounded growth pattern.
If the number of hash clusters is too small, collisions will occur and performance will degrade. Also the unanticipated insertions to tables that are members of a hash cluster may cause the creation of overflow buckets which is a form of chaining. Another limitation is that the hash function can only be used for equality conditions on the full cluster key — not range queries, access via partial key, etc.
Hashed Index Evaluation
Use hash clusters for critical transactions where several rows of one or more tables are accessed together using equality on the full cluster key and those rows will generally not be in the same block OR need fast access on a single row of a table via equality on a unique key.
The ideal candidate for hash cluster organization is information that could be represented as a fixed sized table and is frequently accessed on its full primary key. Another candidate is a constant-sized or slowly growing table whose numeric primary key is frequently used as a foreign key in other tables.
A B-tree index is most effective for high-cardinality data. A bitmapped index is most appropriate and effective for low-cardinality data, where a column has relatively few distinct values. Bitmapped indexing provides both performance benefits and storage savings.
When a bitmap index is created on a column, a bit stream (ones and zeros) is created for each distinct value in the indexed column. A one or zero represents whether or not a record in the table is equal to the bit-streamed value.
Bitmap indexes are an efficient representation of an inverted list index structure. (Fig 2) where each table row is represented by a corresponding bit.
Bitmap Indexes support fast random access on indexed attributes as well as inequality and range queries. They efficiently check for the existence or nonexistence of data records having specific values for the indexed attributes and, in some cases, can completely answer an aggregate query. They outperform B+ tree indexes when there are a large number of rows and #rows/#distinct index values. They may also outperform B+ tree indexes when there are a large number of rows and #rows/#distinct index values < 100 if most queries have complex WHERE clauses.
While bitmap indexes are not for most OLTP applications they are ideal for data warehousing databases. They are very effective for use in an environment where there are no concurrent DML operations, the indexed table has a large number of rows, and #rows/#distinct index values >= 100 OR most queries which might use the index have a complex WHERE clause.
In a B-tree index, the table data is stored in a database object, and the index data, the B-tree, is stored
in another database object. The index data object includes the indexed column values and a ROWID
pointing into the data table. In other words, the indexed values are stored twice, once in the table, and
once in the index.
Usually the indexed columns are a small part of the overall data stored in a row resulting in an index that is smaller than the table. Sometimes an index is comprised of a majority of the columns of the table. This results in the redundant storage of the data. In this case the table would benefit from an index-organized table
In an index-organized table, only one database object is used to store both the data and the index there is no database object used to only store the data. The index structure holds both the indexed column values and the remaining column values.
To implement an index-organized table it is created using the INDEX ORGANIZED option. Access to the index-organized table is just like a regular table, using standard SQL statements for query, insert, update, or delete. The catch to index-organized tables is that no secondary indexes can be created on the table and an updated row may force the entire row to be moved.
Because all data, the key columns and the remaining columns, is stored in the B-tree structure, index-organized tables provide a faster, key-based access to table data for queries involving exact match and/or range search. Once the search has located the key values, the remaining data is present at that location, so there is no need to follow a ROWID back to the table data, as would be the case with a table and B-tree index structure. This eliminates one I/O, namely the read of the table.
Filed under: Technical Tips
Like this post? Subscribe to my RSS feed and get loads more!