Oracle’s introduction of AI Vector Search in Database 23ai brings native support for vector data types, making it easier to store, index, and query embeddings. This is especially useful for applications like retrieval-augmented generation (RAG), recommendation systems, anomaly detection, and AI-powered search. Since vector search is becoming essential in AI and LLM-based solutions, gaining expertise in this area can be valuable.

Oracle is offering its ‘Become an Oracle AI Vector Search Professional‘ learning path and ‘Oracle AI Vector Search Professional Certification (Exam 1Z0-184-25)‘  for free until May 15, 2025, making it a good chance to learn and get certified at no cost and saving the certification exam fee of $245.

What are vectors.

A vector is simply an ordered list of numbers. It can be thought of as a point in space or an arrow with a specific direction and magnitude. In mathematics and machine learning, vectors are widely used to represent data in a structured way.

In natural language processing (NLP) and machine learning, words, sentences, or even entire documents can be converted into vectors. This transformation allows computers to understand relationships between words based on their meanings.

For example:

  • The word “king” might be represented as the vector [0.8, 0.2, 0.9],
  • The word “queen” as [0.8, 0.3, 0.9].

Although we used three-dimensional vectors in the previous example, LLMs typically work with a huge number of dimensions. For simplicity, we will use two- or three-dimensional vectors. As shown above, words with similar meanings will have similar vector representations, allowing models to capture semantic relationships. The process of transforming chunks of text into vectors is known as embedding, and many embedding models are available.

Vector Datatype in Oracle 23ai

In Oracle Database 23ai, a vector datatype is a new data type designed to store and process vector embeddings, which are numerical representations of data points used in machine learning and AI applications. These embedings help capture relationships between data elements, enabling efficient similarity searches and AI-driven queries.

Oracle 23ai allows storing, indexing, and querying vector data natively, eliminating the need for external vector databases. This improves performance and integration, especially for AI-powered applications like semantic search, recommendation systems, and anomaly detection.

-- Vector Constructor
select vector('[2, -.5]');
select vector('[2, -.5]', 2, FLOAT32);

-- Vector columns
create table tv1 (id number, v vector(2, float32));
insert into tv1 values (1, '[2,-1]')
                     , (2, '[-1,-1]')
                     , (3, '[5,-5]')
                     , (4, '[5, 3]')
                     , (5, '[2,-2]')
                     , (6, '[4,-3]')
                     , (7, '[0,-3]');
commit;    
                 
select * from tv1;

A vector can be NULL but not null dimension is allowed. ie: [1.1, NULL, 2.2] is not allowed.

The statement select vector('[2, -.5]', 2, FLOAT32); creates a 2-dimensional vector with values 2 and -0.5, using the FLOAT32 data type. The vector constructor takes a string representation of the vector, its dimension, and an optional data type. Although dimensions and type are optional, if you specify them they must concord.

Oracle supports vectors up to 65,535 dimensions.

A table can have multiple vector columns, each storing vectors of different dimensions or data types, alongside traditional relational columns like integers, strings, or dates. This allows for powerful hybrid data models where structured relational data is combined with vector-based representations, enabling efficient similarity searches, machine learning applications, and advanced analytics within the same table.

Oracle supports the INT8 (1 byte), FLOAT32 (4 bytes), FLOAT64(8 bytes) types.

To calculate the space in bytes for a vector with ‘n’ dimensions, where each dimension is of type FLOAT32, you multiply the number of dimensions by the size of each dimension in bytes. Since FLOAT32 uses 4 bytes per dimension.

The formula is: Total space (bytes) = n (dimensions) * 4 (bytes/dimension).

For example, if you have a 768-dimensional vector with each dimension as FLOAT32, the total space in bytes would be 768 * 4 = 3072 bytes.

Vectors Distance

Vector distance measures how similar or different two vectors are by quantifying the space between them. It is a fundamental concept in machine learning, search engines, and recommendation systems. There are multiple ways to calculate this distance, each suited for different use cases. Common metrics include Euclidean distance, which measures straight-line distance, cosine similarity, which captures angular difference, and Manhattan distance, which sums the absolute differences between coordinates. Choosing the right metric depends on the nature of the data and the problem being solved.

select VECTOR_DISTANCE( vector('[5,5]'), vector('[6,6]'), EUCLIDEAN);

The function VECTOR_DISTANCE returns the distance between two vectors using the specified metric. Other possible metrics include:

  • COSINE (default): Measures the angle between two vectors; values range from -1 (opposite) to 1 (identical).
  • DOT: Computes the dot product of two vectors, useful for ranking similarity in high-dimensional spaces.
  • EUCLIDEAN: Calculates the straight-line (L2) distance between two points in space.
  • EUCLIDEAN_SQUARED: Same as Euclidean but without the square root, making it computationally cheaper.
  • HAMMING: Counts the number of positions where two binary vectors differ.
  • MANHATTAN: Measures the sum of absolute differences between vector coordinates (L1 distance).
  • JACCARD: Compares the similarity between two sets by dividing the intersection size by the union size.

Also the following Shorthand Vector Distance functions and Shorthand Operators can be used:

  • Shorthand Vector Distance functions:
    • L1_DISTANCE equal to VECTOR_DISTANCE(v1, v2, MANHATTAN)
    • L2_DISTANCE equal to VECTOR_DISTANCE(v1, v2, EUCLIDEAN)
    • COSINE_DISTANCE equal to VECTOR_DISTANCE(v1, v2, COSINE)
    • INNER_PRODUCT equal to VECTOR_DISTANCE(v1, v2, DOT)
    • HAMMING_DISTANCE equal to VECTOR_DISTANCE(v1, v2, HAMMING)
    • JACCARD_DISTANCE equal to VECTOR_DISTANCE(v1, v2, JACCARD)
  • Shorthand Operators for Distances
    • <-> is the Euclidean distance operator
    • <=> is the cosine distance operator
    • <#> is the negative dot product operator
-- The following queries are similar


-- Using Vector Distance
select VECTOR_DISTANCE( vector('[5,5]'), vector('[6,6]'), EUCLIDEAN);

-- Using Shorthand Vector Distance
select L2_DISTANCE( vector('[5,5]'), vector('[6,6]'));

-- Using Shorthand Operator
select vector('[5,5]') <-> vector('[6,6]'));
    /* Metrics, shorthand vector distance functions and Shorthand Operators

    COSINE, COSINE_DISTANCE or <=> operator (default): Measures the angle between two vectors; values range from -1 (opposite) to 1 (identical).
    DOT and INNER_PRODUCT: Computes the dot product of two vectors, useful for ranking similarity in high-dimensional spaces.
                           <#> operator is the negative dot product operator: expr1 <#> expr2 is equivalent to -1*INNER_PRODUCT(expr1, expr2).
    EUCLIDEAN, L2_DISTANCE or <-> operator: Calculates the straight-line (L2) distance between two points in space. 
    EUCLIDEAN_SQUARED: Same as Euclidean but without the square root, making it computationally cheaper.
    HAMMING and HAMMING_DISTANCE: Counts the number of positions where two binary vectors differ.
    MANHATTAN and L2_DISTANCE: Measures the sum of absolute differences between vector coordinates (L1 distance).
    JACCARD and JACCARD_DISTANCE: Compares the similarity between two sets by dividing the intersection size by the union size.
    */

Vector Queries and search types

Exact Similarity Search, also known as Flat Search (shown above), computes the exact distance between a query vector and every vector in the database. This method guarantees precise results but can be slow, especially when dealing with large datasets, as it requires a full-scan comparison of all vectors in the table.

-- Vectors Search

  -- Exact Similary Search aka Flat search: Provides most accurate results but takes longer.
    select id, v from tv1
    order by VECTOR_DISTANCE( vector('[3, 3]'), v, EUCLIDEAN)
    FETCH FIRST 3 ROWS ONLY;

    -- equal to --
    select id, v from tv1
    order by L2_DISTANCE( vector('[3, 3]'), v)
    FETCH FIRST 3 ROWS ONLY;

    -- equal to --
    select id, v from tv1
    order by vector('[3, 3]') <-> v
    FETCH FIRST 3 ROWS ONLY;

On the other hand, Approximate Similarity Search speeds up the process by using indexes that approximate the closest vectors, reducing computation time at the cost of some accuracy. This method is particularly useful in large-scale datasets where perfect accuracy isn’t crucial. This type of search is beyond the scope of this post.

Finally, Multi-Vector Similarity Search allows for querying with multiple vectors at once, rather than just a single query vector. This approach computes similarity based on a combination of vectors, which can enhance the accuracy of the search, especially in scenarios with multi-modal or complex data. It’s valuable when the context of multiple vectors needs to be considered together, improving results for applications like recommendation systems or complex AI models. This type of search is beyond the scope of this post.

Other vector functions

These other functions help with vector creation, conversion, and querying the properties of vectors.

-- Other Vector Functions

  -- create a vector from a string
  select to_vector('[2, -.5]', 2, FLOAT32);
  select vector('[2, -.5]', 2, FLOAT32);

  -- Convert a vector to a string or CLOB
  select from_vector(vector('[2, -.5]', 2, FLOAT32));
  select vector_serialize(vector('[2, -.5]', 2, FLOAT32));
  
  -- Return the distance from the origin '[0, 0]'
  select vector_norm(vector('[2, -.5]', 2, FLOAT32));
  
  -- Return the number of dimensions of the vector
  select vector_dimension_count(vector('[2, -.5]', 2, FLOAT32));
  
  -- Return the format of the vector
  select vector_dimension_format(vector('[2, -.5]', 2, FLOAT32));
  select vector_dimension_format(vector('[2, -.5]')); -- Default format is FLOAT32

The functions to_vector and vector create vectors from a string representation. These functions take in the vector elements as a string and specify the dimensionality and format (e.g., FLOAT32). To convert a vector back into a string or CLOB, you can use from_vector or vector_serialize, which return a string representation of the vector’s content. The function vector_norm computes the distance of a vector from the origin, ‘[0, 0]’, which is useful for determining its magnitude. Additionally, vector_dimension_count returns the number of dimensions of a vector, while vector_dimension_format returns the data type format of the vector, such as FLOAT32. If the format is unspecified, the default is FLOAT32.

Reference

Leave a Reply

Discover more from DB-Master

Subscribe now to keep reading and get access to the full archive.

Continue reading