COST-BASED QUERY OPTIMIZATION IN CENTRALIZED RELATIONAL DATABASES

Query optimization is the most significant factor for any centralized relational database management system (RDBMS) that reduces the total execution time of a query. Query optimization is the process of executing a SQL (Structured Query Language) query in relational databases to determine the most efficient way to execute a given query by considering the possible query plans. The goal of query optimization is to optimize the given query for the sake of efficiency. Cost-based query optimization compares different strategies based on relative costs (amount of time that the query needs to run) and selects and executes one that minimizes the cost. The cost of a strategy is just an estimate based on how many estimated CPU and I/O resources that the query will use. In this paper, cost is considered by counting number of disk accesses for each query plan because disk access tends to be the dominant cost in query processing for centralized relational databases.


INTRODUCTION
A centralized relational database is the database that stores and maintains data in a single location using tables (or relations).Because of widespread use of such databases, it is important to reduce system resources required to fulfill a query and ultimately provide correct information faster.Query optimization in database has gained significant importance as it helps to reduce the size, memory usage and time required for any query to be processed.The main objective of any query optimization is to determine the best strategy for executing each query.It identifies an efficient way to execute the query with less time complexity to produce better results.This process can be formally defined as transforming a query into an equivalent form which can be evaluated more efficiently (Tejy 2016).Query optimization is the process of determining the most efficient way to execute a given query by considering the possible query plans.Wong and Youssefi (1976) and Selinger et al. (1979) have done work related to the relational query optimization.Several approaches, methods and techniques of query optimization have been proposed for various DBMS (i. e., relational, deductive, distributed, object, parallel).The quality of query optimization methods depends strongly on the accuracy on the efficiency of cost models (Hussein et al. 2005, Naacke et al. 1998, Zhu et al. 2003, Adali et al. 1996, Ganguly et al. 1996, Gardarin et al. 1996).
Query optimization is an important part of query processing.The four main phases of query processing are decomposition (consisting of parsing and validation), optimization, code generation and execution.Query processing is the activities involved in parsing, validating, optimizing and executing a query.Query processing transforms a query written in a high-level language, typically SQL, into a correct and efficient execution strategy expressed in a low-level language (implementing the relational algebra), and executes the strategy to retrieve the required data.The activity of choosing an efficient executing strategy for processing query is query optimization (Thomas & Carolyn 2015).It is generally a process of reducing total execution time of the query, which is the sum of the execution times of all individual operations that make up the query (Selinger et al. 1979).
Query optimization needs database statistics to evaluate different execution strategies properly.The statistics cover information about relations, attributes and indexes.The accuracy and currency of these statistics have a significant bearing on the efficiency of the execution strategy chosen.Keeping the statistics current can be problematic.If the statistics is updated every time, the database is updated there is significant impact on performance during peak periods.An alternative and generally preferable approach is to update the statistics on a periodic basis, for example, nightly or whenever the system is idle.
The cost-based query optimization technique compares different strategies based on their relative costs and selects the one that minimizes resource usage.Because disk access is slow compared to memory access, disk access tends to be the dominant cost in query processing for a centralized relational DBMS.The main objective of this study is to compare cost of different execution strategies of a SQL query represented as relational algebra expressions in centralized relational databases to choose most efficient execution strategy.
As given in Fig. 1, the query processing has four phases as query decomposition, query optimization, code generation and runtime query execution.The important aspect of query processing is query decomposition.Query decomposition is the first phase of query processing.This phase transforms a high-level query into a relational algebra query and to check whether the query is syntactically and semantically correct.The query optimization phase chooses an efficient execution strategy that minimizes the overall cost of execution.This phase uses information from database statistics to find cost of each execution strategy.The code generation phase receives an optimal execution strategy from the optimization phase and produces an iterative execution plan that is usable by the rest of the database.Runtime query execution is the last phase of query processing and runs the query and displays the required result.

Cost-based query optimization
Cost-based query optimization is an overall process of choosing the most efficient means of executing a SQL statement based on overall cost of the query.The efficient execution is the execution with minimum cost.To find the cost of query execution plan, the optimization technique uses database statistics.
In this optimization technique, all of the possible ways or scenarios in which a query can be executed will be assigned a 'cost', which indicates how efficiently that query can be run.Then, the optimizer will pick the scenario that has the least cost and execute the query using that scenario, because the query with least cost is the most efficient way to run the query.The dominant cost in query processing for centralized relational databases is disk access because disk access is slower than memory access.So the optimization technique counts the number of disk accesses of each scenario and execute the scenario with minimum number of disk accesses.In centralized systems, the costs are dominated by the time for secondary storage access although the CPU costs may be quite high for complex queries (Gotlieb 1975).
During decomposition phase, high level query (SQL) is transformed into some internal representation typically using query tree (relational algebra tree).Thomas and Carolyn (2015) have devised a rule for constructing query tree as follows:  A leaf node is created for each base relation in the query.
 A non leaf node is created for each intermediate relation produced by a relational algebra operation.
 The root of the tree represents the result of thee query.
 The sequence of operations is directed from the leaves to the root.

RESULTS AND DISCUSSION
Consider two relations employee (emp_no, emp_name, emp_address, position, salary, branch_no) and branch (branck_no, branch_city, branch_address, city) with a member of employee can only work at one branch.
Consider an SQL query as given below.The relational algebra trees for each of above queries are listed below as shown in Figs 2-4.

SELECT
Suppose there are 2000 tuples in Employee, 20 tuples in Branch, 20 Managers (one for each branch), and 10 Kathmandu branches.To compare these three queries, we assume number of disk accesses.We also assume that there are no indexes or sort keys on either relation.The results of any intermediate operations are stored on disk.
The cost of the final write is ignored because it is the same in each query.We further assume that tuples are accessed one at a time (although in practice disk accesses would be based on blocks, which would typically contain several tuples), and main memory is large enough to process entire relations for each relational algebra operation.= 2080 disk accesses.From the calculations above, it is clear that query 3 is the most efficient query and is 2.89 times faster than query 2 and 39.43 times faster than the query1.Figure 5 shows the cost comparisons of each relational algebra query.It can be seen clearly that that the cost of query 3 is minimum as compared to the cost of other two queries.

Fig. 5. Cost comparison of each query
If we considered 10000 tuples in Employee and 1000 tuples in Branch query 3 would be 2.38 times faster than query 2 and 1536.84 times faster than the query1.Since, Cartesian product and Join operations are much more expensive than Selection operation, query3 significantly reduces the size of the relations that are being joined together.

CONCLUSION
Query optimization is the process of determining the most efficient method for a SQL statement to access requested data.A SQL query can have different query execution strategies and the cost-based query optimization technique selects and executes the query execution strategy with least cost among all the execution strategies.To find the cost of each execution strategy, the optimization technique uses database statistics, because disk access is slower as compared with memory access and disk access tends to be the dominant cost in query processing for centralized relational databases.Using database statistics, the optimization technique counts the number of disk accesses for each execution strategy in centralized relational databases.