Saturday, April 28, 2012

SQL structures in Business Intelligence (BI) Testing - Top "10" learnings

Business Intelligence Systems are designed to provide strategic information for analysis. Some of the features of Business Intelligence Systems are:
1) Database designed for analytical tasks
2) Data from Multiple Source Systems
3) Read-Intensive Data
4) Availability of Current and Historical data
5) Ability for users to initiate reports

A few of the common terms used in BI space are:
a)Source Systems
b)Staging Area
c)Data Extraction, Transformation and Loading (ETL)
d)Enterprise Data-warehouse (EDW)
e)Reporting (Drill through reports)
f)UI components such as – Cubes, Dimensions & Measures
g)FACT tables, Dimension Tables
....etc

With the basics out of the way, lets jump on to core of the post – Top 10 potshots for solid SQL Structures:

1.Alias Names should be consistent across different SQL’s and should be sensible. Have a standard for alias names. For example –All fact tables should have alias names starting with ‘F’ and all dimension tables should have alias names starting with ‘D’. This enhances debugging

a.Fact_Company FCOM
b.Dim_Company DCOM

2.Use ‘ISNULL’ clause whenever there is data comparison between 2 columns. The SQL server does not compare NULL values. Hence if the NULL value is converted to some numeric/text value we can compare those records as well

3.Add appropriate comments wherever required

4.To make the SQL statements more readable, start each clause on a new line and indent when needed. Following is an example:
SELECT EMPID, FIRSTNAME
FROM DBO.EMPLOYEE
WHERE TITLE LIKE SALARY% AND COUNTRY=‘ABC’

5.Use DISTINCT clause only in SELECT statements if there is a possibility of duplicate rows. The DISTINCT clause creates a lot of extra work for SQL server and reduces the physical resources of other SQL statements.

6.Avoid ‘NOT IN’ condition as far as possible because it offers poor performance. Instead use one of the following
a.Use EXISTS or NOT EXISTS
b.Perform a LEFT OUTER JOIN and check for NULL condition

SELECT STG_EMP.EMP_ID
FROM DBO.STG_EMPLOYEE STG_EMP
WHERE EMP_ID NOT IN (SELECT EMPLOYEE_ID FROM DBO.DIM_EMPLOYEE)

Runs Slower Than

SELECT STG_EMP.EMP_ID,DIM_EMP.EMPLOYEE_ID
FROM DBO.STG_EMPLOYEE STG_EMP LEFT OUTER JOIN DBO.DIM_EMPLOYEE DIM_EMP
ON STG_EMP.EMP_ID=EDW.EDW_EMPLOYEE_ID
WHERE EMPLOYEE_ID IS NULL

7.Avoid using ORDER BY in the SELECT statements unless it is really needed because it adds a lot of extra overhead

8.UNION combines the result sets of 2 or more "select" queries. It removes duplicate rows between the various "select" statements whereas “UNION ALL” query returns all rows (even if the row exists in more than one of the "select" statements). Use UNION ALL instead of UNION when you are sure that the result sets of select queries are distinct. This prevents the UNION statement from trying to sort the data and remove duplicates, which hurts performance.

9.Avoid using SELECT *. Always write the required column names after the SELECT statement, like:

SELECT EMPLOYEEID, FIRSTNAME

This decreases the unnecessary disk I/O

10.When there is a choice of using the IN or the EXISTS clause in SQL, prefer using the EXISTS clause, as it is usually more efficient and performs faster. Consider the following example

SELECT STG_EMP.EMP_ID
FROM DBO.STG_EMPLOYEE STG_EMP
WHERE EMP_ID IN (SELECT EMPLOYEE_ID FROM DBO.DIM_EMPLOYEE)

is less efficient than

SELECT STG_EMP.EMP_ID
FROM DBO.STG_EMPLOYEE STG_EMP
WHERE EXISTS ( SELECT 1 FROM DBO.DIM_EMPLOYEE DIM_EMP
WHERE DIM_EMP. EMPLOYEE_ID = STG_EMP.EMP_ID)

Bonus Tip - When there is a choice of using the IN or the BETWEEN clauses in your Transact-SQL, use the BETWEEN clause, as it is much more efficient.

Disclaimer: The best practices listed are a result of my learning’s from encounters with seasoned DWBI geeks.

-Manav Ahuja