8. 11. 2023
5 min read
How to identify potential outliner queries in your SQL database
Efficient databases hinge on swift, precise SQL queries and SQL query performance testing is your tuning tool to achieve this. With it, you simulate real-world scenarios to uncover issues before they impact user experiences or system stability.
Kristián Müller
What is it?
SQL query performance testing is crucial to keep our databases operating efficiently. As we deal with larger and more intricate databases, it's essential to ensure that our SQL queries run swiftly. Performance testing is like fine-tuning a machine to get the best out of it. Just as a well-maintained engine leads to better fuel efficiency and a smoother ride, optimizing SQL queries improves the overall performance of our database systems.
By testing our SQL queries under various conditions, we can pinpoint and address any issues that could slow things down. It's similar to stress-testing a vehicle in different weather and terrain conditions to ensure it performs reliably in any situation. Similarly, with SQL queries, we simulate real-world scenarios and heavy workloads to identify bottlenecks, inefficient query structures, or missing indexes. This proactive approach helps us rectify these issues before they impact user experience or system stability.
Why should I do it?
There are many reasons why regular performance testing of used SQL queries should be done, but these are the key reasons:
Early Issue Detection: It helps uncover bottlenecks, memory leaks, or coding errors early in the development cycle, reducing the cost of fixing issues later.
Scalability Planning: Performance testing helps you plan for future growth by understanding how your system handles increased loads.
Resource Allocation: Performance testing helps allocate resources effectively, ensuring that critical tasks get the necessary computing power.
Cost Optimization: Efficient applications require fewer server resources, reducing hosting and infrastructure costs.
User Satisfaction: Performance testing ensures that your application responds quickly and reliably to user interactions, leading to improved user satisfaction and retention.
How should I do it?
Define Objectives and Metrics:
Begin by understanding the specific goals and objectives of the performance testing. What are you trying to achieve? Define clear, measurable performance metrics like response time, throughput, and resource utilization.
Identify Test Environment:
Set up a dedicated test environment that closely resembles your production environment. This includes hardware, software, network conditions, and databases.
Create Test Data:
Generate or prepare test data that reflects the diversity and volume of data that your system should be eventually able to handle during actual usage.
Choose Testing Tools:
Select appropriate performance testing tools that suit your needs. In case of complex testing, you can use some popular tool that fits your needs or you can write your own utils that measure the objectives you defined before.
Execute Tests:
Run performance tests according to your defined test cases and scenarios. This should include tests for load, stress, scalability, and/or endurance.
Analyze Results:
Analyze the test results to identify performance bottlenecks, errors, and areas that need improvement. This may involve pinpointing slow database queries, inefficient code, or infrastructure limitations.
Optimize and Retest:
Optimize the code as needed and re-run the tests to confirm improvements.
Report and Documentation:
Prepare a report that summarizes the test results, identified issues, and actions taken for optimization {in case there were any actions taken}.
Automation:
Consider automating performance tests to run them regularly as part of your continuous integration and delivery (CI/CD) pipeline.
Iterate:
Performance testing should be an iterative process, with regular testing cycles as the application evolves and user loads change.
Custom solution without any 3rd party service
First, let's examine the folder structure. This might seem unimportant at first glance, but it's crucial to follow this structure because everything relies on it. You'll gain a better understanding of its importance as you read further.
The predefined function benchmark.sql
, situated in the directory /performance-tests/functions/
, contains both the function definition for conducting performance tests and its utilization.
In addition to the function, our primary concern is to ensure that we create a folder named after the query we intend to test within the /performance-tests/
directory as well as three files in it: result.md
, seed.sql
and test.sql
.
IMPORTANT: Query folder name should be in pascalCase naming convention
Contents of all files can be found further below.
/api├── ...├── /performance-tests │ ├── /functions │ │ └── benchmark.sql # Benchmark function for performance testing│ └── /queryName # Folder named after our query {pascalCase naming convention}│ ├── result.md│ ├── seed.sql│ └── test.sql└── ...
Content of benchmark.sql
:
-- Define a benchmarking function for SQL queriesCREATE OR REPLACE FUNCTION bench(query TEXT, iterations INTEGER = 1000)RETURNS TABLE (avg FLOAT, min FLOAT, q1 FLOAT, median FLOAT, q3 FLOAT, p95 FLOAT, max FLOAT) AS $$DECLARE _start TIMESTAMPTZ; -- Timestamp to record query start time _end TIMESTAMPTZ; -- Timestamp to record query end time _delta DOUBLE PRECISION; -- Elapsed time in millisecondsBEGIN -- Create a temporary table to store query execution times CREATE TEMP TABLE IF NOT EXISTS _bench_results ( elapsed DOUBLE PRECISION );
-- Warm the cache by executing the query 5 times FOR i IN 1..5 LOOP EXECUTE query; END LOOP;
-- Run the query for the specified number of iterations and record elapsed times FOR i IN 1..iterations LOOP _start = clock_timestamp(); -- Record start time EXECUTE query; -- Execute the query _end = clock_timestamp(); -- Record end time _delta = 1000 * ( extract (epoch FROM _end) - extract (epoch FROM _start) ); -- Calculate elapsed time in milliseconds INSERT INTO _bench_results VALUES (_delta); -- Store elapsed time in the temporary table END LOOP;
-- Return performance metrics based on the recorded data RETURN QUERY SELECT ROUND(AVG(elapsed)::NUMERIC, 2)::FLOAT, -- Average elapsed time ROUND(MIN(elapsed)::NUMERIC, 2)::FLOAT, -- Minimum elapsed time ROUND(PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY elapsed)::NUMERIC, 2)::FLOAT, -- 25th percentile (Q1) ROUND(PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY elapsed)::NUMERIC, 2)::FLOAT, -- Median (50th percentile) ROUND(PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY elapsed)::NUMERIC, 2)::FLOAT, -- 75th percentile (Q3) ROUND(PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY elapsed)::NUMERIC, 2)::FLOAT, -- 95th percentile (p95) ROUND(MAX(elapsed)::NUMERIC, 2)::FLOAT -- Maximum elapsed time FROM _bench_results;
-- Clean up: Drop the temporary table DROP TABLE IF EXISTS _bench_results;END$$LANGUAGE plpgsql; -- Define the language for this function as PL/pgSQL
SELECT * FROM bench(:'query'); -- Utilization of the bench function with 'query' variable {read further for more information}
The output of the bench
function provides you with valuable insights into the performance of your SQL query. Here's what you can inspect and why it's important:
Average Elapsed Time: The average time it takes to execute the query over the specified number of iterations. This gives you a sense of the query's typical performance.
Minimum Elapsed Time: The quickest execution time among all iterations. This helps identify best-case performance.
Quartiles (Q1, Median, Q3): Quartiles divide the data into four equal parts. Q1 represents the 25th percentile, Median the 50th percentile, and Q3 the 75th percentile of elapsed times. These metrics provide insights into the distribution of query execution times.
95th Percentile: The elapsed time below which 95% of query executions fall. It helps identify outliers and potential performance bottlenecks.
Maximum Elapsed Time: The longest execution time among all iterations. This helps identify worst-case performance.
Into the file seed.sql
we created earlier we need to write our raw SQL seeds for performance testing.
For optimal results, we should generate between 10x
and 1,000x
the number of rows we currently store in our database.
INSERT INTO users ( id, name)SELECT ROW_NUMBER() OVER (ORDER BY GENERATE_SERIES), -- id 'TEST' -- nameFROM GENERATE_SERIES(1, 10000); -- number of rows to insert
In case you want to create n
number of rows you can use GENERATE_SERIES(1, n)
function as in the example below. (More information about this and other functions can be found here.)
Last but not least, we need to paste/(re)write the query that we want to test into the test.sql
file.
SELECT * FROM users WHERE name = 'TEST';
In reality, you will have larger and more complex queries as well as larger and more complex seeds. The examples above are just for illustration purposes.
After all the steps described above, the last thing we need to do is run the command:
$ npm run test:performance -- <queryName>
To be able to run this shell script, the scripts section in package.json file needs to be extended likewise:
{ "scripts": { "test:performance": "sh scripts/../index.sh" }}
The content of the shell script can be found below.
The shell script takes care of a few things:
Makes sure that the folder for specific performance exists
Also makes sure that
seed.sql
andtest.sql
are defined in the performance test folderExecutes the seed(s) and benchmark of tested query in a transaction against the db running in a docker container
#!/bin/bash
# Check if queryName argument is providedif [ "$#" -ne 1 ]; then echo "Usage: npm run test:performance -- <queryName>" exit 1fi
# Assign the argument to queryNamequeryName="$1"
# Check if the specified query directory existsif [ ! -d "./performance-tests/${queryName}" ]; then echo "Error: Directory './performance-tests/${queryName}' does not exist." exit 1fi
# Check if the specified query seeds exist in the directoryif [ ! -f "./performance-tests/${queryName}/seed.sql" ]; then echo "Error: File 'seed.sql' in directory './performance-tests/${queryName}' does not exist." exit 1fi
# Check if the specified query exists in the directoryif [ ! -f "./performance-tests/${queryName}/test.sql" ]; then echo "Error: File 'test.sql' in directory './performance-tests/${queryName}' does not exist." exit 1fi
# Read the content of seed.sql and benchmark.sql into variablesseedContent="$(< "./performance-tests/${queryName}/seed.sql")"benchmarkContent="$(< "./performance-tests/functions/benchmark.sql")"
# Execute the merged SQL wrapped in transaction using psql in Dockerecho "BEGIN; $seedContent $benchmarkContent ROLLBACK;" | docker exec -i "YOUR_CONTAINER_NAME" psql -U postgres -d "YOUR_DATABASE_NAME" -v query="$(< ./performance-tests/${queryName}/test.sql)"
After running the shell script, a similar output should be visible in the terminal:
BEGININSERT 0 10000CREATE FUNCTION avg | min | q1 | median | q3 | p95 | max -----+------+------+--------+------+------+----- 0.6 | 0.54 | 0.56 | 0.62 | 0.62 | 0.68 | 0.7(1 row)
ROLLBACK
In case everything looks healthy and doesn’t need any refactor, results are stored in result.md
file.
## Performance Results for 10k Users
The following table displays the performance results for a test with 10,000 users:
| Metric | Average | Minimum | Q1 | Median | Q3 | P95 | Maximum || :-------- | :------ | :------ | :----- | :----- | :----- | :----- | :------ || Time (ms) | 0.6 | 0.54 | 0.56 | 0.62 | 0.62 | 0.68 | 0.7 |These results indicate the elapsed time (in milliseconds) for executing the queryon 10,000 users. The average time is 0.6 ms, with the minimum time being0.54 ms and the maximum time reaching 0.7 ms. The median time(50th percentile) is 0.62 ms, and the first quartile (25th percentile) andthird quartile (75th percentile) values are 0.56 ms and 0.62 ms, respectively.The 95th percentile time (P95) is 0.68 ms.
PostgreSQL function languages
PostgreSQL supports various procedural languages for writing functions. The LANGUAGE
clause specifies the language used for the specific function. In the case above, we are using PL/pgSQL.
You can explore more about PostgreSQL function languages in the official documentation.
Now you have the tools to fine-tune your SQL queries
In conclusion, SQL query performance testing is the key to getting the most out of your database. By proactively identifying and addressing bottlenecks, inefficient queries, and resource allocation issues, you can help ensure applications respond swiftly and reliably. With the custom solution we've outlined, you have the tools to fine-tune your SQL queries and optimize your database's performance.
You might
also like