How to identify potential outliner queries in your SQL database

Efficient databases hinge on swift, precise SQL queries and SQL query performance testing is your tuning tool to achieve this. With it, you simulate real-world scenarios to uncover issues before they impact user experiences or system stability.

Kristián Müller

What is it?

SQL query performance testing is crucial to keep our databases operating efficiently. As we deal with larger and more intricate databases, it's essential to ensure that our SQL queries run swiftly. Performance testing is like fine-tuning a machine to get the best out of it. Just as a well-maintained engine leads to better fuel efficiency and a smoother ride, optimizing SQL queries improves the overall performance of our database systems.

By testing our SQL queries under various conditions, we can pinpoint and address any issues that could slow things down. It's similar to stress-testing a vehicle in different weather and terrain conditions to ensure it performs reliably in any situation. Similarly, with SQL queries, we simulate real-world scenarios and heavy workloads to identify bottlenecks, inefficient query structures, or missing indexes. This proactive approach helps us rectify these issues before they impact user experience or system stability.

Why should I do it?

There are many reasons why regular performance testing of used SQL queries should be done, but these are the key reasons:

Early Issue Detection: It helps uncover bottlenecks, memory leaks, or coding errors early in the development cycle, reducing the cost of fixing issues later.
Scalability Planning: Performance testing helps you plan for future growth by understanding how your system handles increased loads.
Resource Allocation: Performance testing helps allocate resources effectively, ensuring that critical tasks get the necessary computing power.
Cost Optimization: Efficient applications require fewer server resources, reducing hosting and infrastructure costs.
User Satisfaction: Performance testing ensures that your application responds quickly and reliably to user interactions, leading to improved user satisfaction and retention.

How should I do it?

Define Objectives and Metrics:
- Begin by understanding the specific goals and objectives of the performance testing. What are you trying to achieve? Define clear, measurable performance metrics like response time, throughput, and resource utilization.
Identify Test Environment:
- Set up a dedicated test environment that closely resembles your production environment. This includes hardware, software, network conditions, and databases.
Create Test Data:
- Generate or prepare test data that reflects the diversity and volume of data that your system should be eventually able to handle during actual usage.
Choose Testing Tools:
- Select appropriate performance testing tools that suit your needs. In case of complex testing, you can use some popular tool that fits your needs or you can write your own utils that measure the objectives you defined before.
Execute Tests:
- Run performance tests according to your defined test cases and scenarios. This should include tests for load, stress, scalability, and/or endurance.
Analyze Results:
- Analyze the test results to identify performance bottlenecks, errors, and areas that need improvement. This may involve pinpointing slow database queries, inefficient code, or infrastructure limitations.
Optimize and Retest:
- Optimize the code as needed and re-run the tests to confirm improvements.
Report and Documentation:
- Prepare a report that summarizes the test results, identified issues, and actions taken for optimization {in case there were any actions taken}.
Automation:
- Consider automating performance tests to run them regularly as part of your continuous integration and delivery (CI/CD) pipeline.
Iterate:
- Performance testing should be an iterative process, with regular testing cycles as the application evolves and user loads change.

Custom solution without any 3rd party service

First, let's examine the folder structure. This might seem unimportant at first glance, but it's crucial to follow this structure because everything relies on it. You'll gain a better understanding of its importance as you read further.

The predefined function benchmark.sql, situated in the directory /performance-tests/functions/, contains both the function definition for conducting performance tests and its utilization.

In addition to the function, our primary concern is to ensure that we create a folder named after the query we intend to test within the /performance-tests/ directory as well as three files in it: result.md, seed.sql and test.sql.

IMPORTANT: Query folder name should be in pascalCase naming convention

Contents of all files can be found further below.

/api
├── ...
├── /performance-tests     
│   ├── /functions           
│   │   └── benchmark.sql   # Benchmark function for performance testing
│   └── /queryName          # Folder named after our query {pascalCase naming convention}
│       ├── result.md
│       ├── seed.sql
│       └── test.sql
└── ...

Content of benchmark.sql :

-- Define a benchmarking function for SQL queries
CREATE OR REPLACE FUNCTION bench(query TEXT, iterations INTEGER = 1000)
RETURNS TABLE (avg FLOAT, min FLOAT, q1 FLOAT, median FLOAT, q3 FLOAT, p95 FLOAT, max FLOAT) AS $$
DECLARE
    _start TIMESTAMPTZ; -- Timestamp to record query start time
    _end TIMESTAMPTZ; -- Timestamp to record query end time
    _delta DOUBLE PRECISION; -- Elapsed time in milliseconds
BEGIN
    -- Create a temporary table to store query execution times
    CREATE TEMP TABLE IF NOT EXISTS _bench_results (
        elapsed DOUBLE PRECISION
    );

    -- Warm the cache by executing the query 5 times
    FOR i IN 1..5 LOOP
        EXECUTE query;
    END LOOP;

    -- Run the query for the specified number of iterations and record elapsed times
    FOR i IN 1..iterations LOOP
        _start = clock_timestamp(); -- Record start time
        EXECUTE query; -- Execute the query
        _end = clock_timestamp(); -- Record end time
        _delta = 1000 * ( extract (epoch FROM _end) - extract (epoch FROM _start) ); -- Calculate elapsed time in milliseconds
        INSERT INTO _bench_results VALUES (_delta); -- Store elapsed time in the temporary table
    END LOOP;

    -- Return performance metrics based on the recorded data
    RETURN QUERY SELECT
        ROUND(AVG(elapsed)::NUMERIC, 2)::FLOAT, -- Average elapsed time
        ROUND(MIN(elapsed)::NUMERIC, 2)::FLOAT, -- Minimum elapsed time
        ROUND(PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY elapsed)::NUMERIC, 2)::FLOAT, -- 25th percentile (Q1)
        ROUND(PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY elapsed)::NUMERIC, 2)::FLOAT, -- Median (50th percentile)
        ROUND(PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY elapsed)::NUMERIC, 2)::FLOAT, -- 75th percentile (Q3)
        ROUND(PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY elapsed)::NUMERIC, 2)::FLOAT, -- 95th percentile (p95)
        ROUND(MAX(elapsed)::NUMERIC, 2)::FLOAT -- Maximum elapsed time
    FROM _bench_results;

    -- Clean up: Drop the temporary table
    DROP TABLE IF EXISTS _bench_results;
END
$$
LANGUAGE plpgsql; -- Define the language for this function as PL/pgSQL

SELECT * FROM bench(:'query'); -- Utilization of the bench function with 'query' variable {read further for more information}

The output of the bench function provides you with valuable insights into the performance of your SQL query. Here's what you can inspect and why it's important:

Average Elapsed Time: The average time it takes to execute the query over the specified number of iterations. This gives you a sense of the query's typical performance.
Minimum Elapsed Time: The quickest execution time among all iterations. This helps identify best-case performance.
Quartiles (Q1, Median, Q3): Quartiles divide the data into four equal parts. Q1 represents the 25th percentile, Median the 50th percentile, and Q3 the 75th percentile of elapsed times. These metrics provide insights into the distribution of query execution times.
95th Percentile: The elapsed time below which 95% of query executions fall. It helps identify outliers and potential performance bottlenecks.
Maximum Elapsed Time: The longest execution time among all iterations. This helps identify worst-case performance.

Into the file seed.sql we created earlier we need to write our raw SQL seeds for performance testing.

For optimal results, we should generate between 10x and 1,000x the number of rows we currently store in our database.

INSERT INTO users (
		id,
		name
)
SELECT
		ROW_NUMBER() OVER (ORDER BY GENERATE_SERIES), -- id
		'TEST' -- name
FROM
		GENERATE_SERIES(1, 10000); -- number of rows to insert

In case you want to create n number of rows you can use GENERATE_SERIES(1, n)function as in the example below. (More information about this and other functions can be found here.)

Last but not least, we need to paste/(re)write the query that we want to test into the test.sql file.

SELECT * FROM users WHERE name = 'TEST';

In reality, you will have larger and more complex queries as well as larger and more complex seeds. The examples above are just for illustration purposes.

After all the steps described above, the last thing we need to do is run the command:

$ npm run test:performance -- <queryName>

To be able to run this shell script, the scripts section in package.json file needs to be extended likewise:

{
	"scripts": {
		"test:performance": "sh scripts/../index.sh"
	}
}

The content of the shell script can be found below.

The shell script takes care of a few things:

Makes sure that the folder for specific performance exists
Also makes sure that seed.sql and test.sql are defined in the performance test folder
Executes the seed(s) and benchmark of tested query in a transaction against the db running in a docker container

#!/bin/bash

# Check if queryName argument is provided
if [ "$#" -ne 1 ]; then
  echo "Usage: npm run test:performance -- <queryName>"
  exit 1
fi

# Assign the argument to queryName
queryName="$1"

# Check if the specified query directory exists
if [ ! -d "./performance-tests/${queryName}" ]; then
  echo "Error: Directory './performance-tests/${queryName}' does not exist."
  exit 1
fi

# Check if the specified query seeds exist in the directory
if [ ! -f "./performance-tests/${queryName}/seed.sql" ]; then
  echo "Error: File 'seed.sql' in directory './performance-tests/${queryName}' does not exist."
  exit 1
fi

# Check if the specified query exists in the directory
if [ ! -f "./performance-tests/${queryName}/test.sql" ]; then
  echo "Error: File 'test.sql' in directory './performance-tests/${queryName}' does not exist."
  exit 1
fi

# Read the content of seed.sql and benchmark.sql into variables
seedContent="$(< "./performance-tests/${queryName}/seed.sql")"
benchmarkContent="$(< "./performance-tests/functions/benchmark.sql")"

# Execute the merged SQL wrapped in transaction using psql in Docker
echo "BEGIN; $seedContent $benchmarkContent ROLLBACK;" | docker exec -i "YOUR_CONTAINER_NAME" psql -U postgres -d "YOUR_DATABASE_NAME" -v query="$(< ./performance-tests/${queryName}/test.sql)"

After running the shell script, a similar output should be visible in the terminal:

BEGIN
INSERT 0 10000
CREATE FUNCTION
 avg | min  |  q1  | median |  q3  | p95  | max 
-----+------+------+--------+------+------+-----
 0.6 | 0.54 | 0.56 |  0.62  | 0.62 | 0.68 | 0.7
(1 row)

ROLLBACK

In case everything looks healthy and doesn’t need any refactor, results are stored in result.md file.

## Performance Results for 10k Users

The following table displays the performance results for a test with 10,000 users:

| Metric    | Average | Minimum | Q1     | Median | Q3     | P95    | Maximum |
| :-------- | :------ | :------ | :----- | :----- | :----- | :----- | :------ |
| Time (ms) | 0.6     | 0.54    | 0.56   | 0.62   | 0.62   | 0.68   | 0.7     |
These results indicate the elapsed time (in milliseconds) for executing the query
on 10,000 users. The average time is 0.6 ms, with the minimum time being
0.54 ms and the maximum time reaching 0.7 ms. The median time
(50th percentile) is 0.62 ms, and the first quartile (25th percentile) and
third quartile (75th percentile) values are 0.56 ms and 0.62 ms, respectively.
The 95th percentile time (P95) is 0.68 ms.

PostgreSQL function languages

PostgreSQL supports various procedural languages for writing functions. The LANGUAGE clause specifies the language used for the specific function. In the case above, we are using PL/pgSQL.

You can explore more about PostgreSQL function languages in the official documentation.

Now you have the tools to fine-tune your SQL queries

In conclusion, SQL query performance testing is the key to getting the most out of your database. By proactively identifying and addressing bottlenecks, inefficient queries, and resource allocation issues, you can help ensure applications respond swiftly and reliably. With the custom solution we've outlined, you have the tools to fine-tune your SQL queries and optimize your database's performance.