part about principle of query optimization

940c845c · Maurin Gilles · 0d2b1f4c · 940c845c
Commit 940c845c authored 5 months ago by Maurin Gilles
--- a/rapport/main.tex
+++ b/rapport/main.tex
@@ -2,11 +2,37 @@
 \usepackage[margin=1in,footskip=0.5in]{geometry}
 \usepackage[english]{babel}

-\usepackage{hyperref,enumitem,mathtools,ragged2e,lipsum}
+\usepackage{hyperref,enumitem,mathtools,ragged2e,lipsum,tabularx}
 \usepackage{graphicx,multicol,caption}
-\usepackage{algorithm, algpseudocode}
+\usepackage{algorithm, algpseudocode,listings}
 \usepackage{indentfirst}
 \usepackage[dvipsnames]{xcolor}
+\usepackage[normalem]{ulem}
+\usepackage{etoolbox}
+
+\AtBeginEnvironment{tabularx}{\footnotesize\centering}
+
+\definecolor{dkgreen}{rgb}{0,0.6,0}
+\definecolor{gray}{rgb}{0.5,0.5,0.5}
+\definecolor{mauve}{rgb}{0.58,0,0.82}
+\lstset{language=SQL,
+  basicstyle={\small\ttfamily},
+  belowskip=3mm,
+  breakatwhitespace=true,
+  breaklines=true,
+  classoffset=0,
+  columns=flexible,
+  commentstyle=\color{dkgreen},
+  framexleftmargin=0.25em,
+  frameshape={}{}{}{}, %To remove to vertical lines on left, set `frameshape={}{}{}{}`
+  keywordstyle=\bf,
+  numbers=left, %If you want line numbers, set `numbers=left`
+  numberstyle=\tiny\color{gray},
+  showstringspaces=false,
+  stringstyle=\color{mauve},
+  tabsize=3,
+  xleftmargin =1em
+}

 \usepackage[backend=biber,style=alphabetic,sorting=ynt]{biblatex}
 \addbibresource{references.bib}
@@ -19,7 +45,6 @@
 \date{\today}

 \newtheorem{definition}{Definition}
-\newtheorem{example}{Example}[section]

 \newcommand{\join}{$\bowtie$}
 \newenvironment{Figure}{\par\medskip\noindent\minipage{\linewidth}}{\endminipage\par\medskip}
@@ -56,16 +81,16 @@ This tends to become a limit, in the common situations where one would like to l
 \textbf{Motivational example:} A travel agency TA plans travels thanks to its partners, an airline AL and a hotel chain HC.
 TA stores information about its customers and their preferences in an internal DBMS. 
 To plan a travel, this data has to be linked with information from AL and HC's own databases, and an external API that provides real-time exchange rates.
-In addition, TA, AL and HC might use database with different models (relational, XML, RDF...)
+In addition, TA, AL and HC might use DBMSs with different models (relational, XML, RDF...)\\

 \normalshape
 Data integration systems (DISs) aim to provide solutions for problems such as the one exposed in the previous example,
 by providing a uniform and semantic-oriented access to data from sources that evolve independently, on different formats, with different access time, and a lack of preliminary information about the data.\par

-These issues are specific to DISs, and handling them requires new optimisation techniques that aren't necessary in traditional DBMSs.
-The common approach in data integration to manage the heterogeneous sources by \textit{mapping} them to a certain unique model, in order to process queries over this model.
+These issues are specific to DISs, and handling them requires new optimization techniques that aren't necessary in traditional DBMSs.
+The common approach in data integration is to manage the heterogeneous sources by \textit{mapping} them to a certain unique model, in order to process queries over this model.
 The correspondance between the global schema that connects the mapped sources together, and the local schemas applied over the actual autonomous sources, is provided by a central module in a DIS called \textit{mediator}.
-There are multiple approaches to man efficiently the correspondances between the global and local schemas using views (GAV, LAV, GLAV) which are thoroughly explained in \cite{katsis}.
+There are multiple approaches to manage efficiently the correspondances between the global and local schemas using views (GAV, LAV, GLAV) which are thoroughly explained in \cite{katsis}.
 Several mapping strategies have been developed, referenced in the introduction of \cite{buron}.\\

 This paper assumes a mapping over a relational model and focuses on another crucial mission of the mediator which is query optimization.
@@ -79,15 +104,58 @@ and more particularly those that stipulate any preliminary knowledge of the data

 \subsection{Principle of query optimization}

-\lipsum[5]
+A \textit{query} is a question from a user about the data managed by the DBMS or DIS.
+\textit{Query processing} defines the establishment of a process to answer queries, and specifies the concrete steps of this process.
+The first step is to \textit{parse} the query in a sequence of operations applied on the sources.
+The result is a \textit{query plan}, which is equivalent to a logical expression in relational algebra in our case.\newline

-\begin{Figure}
+\itshape
+\textbf{Example:} Information about dogs, cats and rabbits are stored in three different sources.
+For each pet, the information contains its name and an id for its master.\\
+
+\uline{User query:} "What dogs, cats and rabbits have the same master?"\\
+
+\uline{Query expressed in an uniform language (SQL here):}
+\begin{lstlisting}[language=SQL]
+SELECT Dogs.name, Cats.name, Rabbit.name
+FROM Dogs
+    INNER JOIN Cats
+        ON Dogs.master=Cats.master
+    INNER JOIN Rabbits
+        ON Dogs.master=Rabbits.master
+\end{lstlisting}
+
+\uline{Query plan in relational algebra:}\\
+$
+\pi_{Dogs.name,Cats.name,Rabbits.name}(
+    (Dogs \bowtie_{master} Cats)
+    \bowtie_{master} Rabbits
+)
+$\\
+\normalshape
+
+The query plan defines the sequence of operators that will be applied to the data in order to produce an output answering the query.
+Figure \ref{fig:query_tree} is an example of such a representation.
+\begin{Figure}\label{fig:query_tree}
  \centering
-  \includegraphics[width=.8\linewidth]{./figs/wardos.png}
-  \captionof{figure}{Example}
+  \includegraphics[width=.55\linewidth]{./figs/query_tree_ex.png}
+  \captionof{figure}{Query plan example}
 \end{Figure}

-\lipsum[6-7]
+However, a query plan is not unique, since a different sequence of operators could provide the same output.
+Two plans are called \textit{equivalent} when they always produce similar outputs for similar sources.
+In our previous example, equivalent plans include those where where dogs are joined with rabbits first, and those where cats and rabbits are joined first.\par
+
+So every query can be answered by a number of equivalent plans that grows very fast when queries become more complex.
+Although these plans are algebraically similar, the difference between their computation times can be arbitrarily large.
+The goal of query optimization is to seach for the best, or at least a good enough query plan to answer the query.\par
+
+Defining what makes a plan better than another is a whole question, usually dealt with by defining a \textit{cost metric},
+even if the idea of multi-objective query optimization has been proposed for DBMSs \cite{trummer}.\par
+In the relational model, some optimization strategies are proven to always improve the query plan.
+However, some issues don't have constant solution and require to be adaptated for each query.
+Those issues are traditionally solved thanks to information about the data in the sources that makes an estimation of the cost of each plan possible.
+Making those estimations without such information is a major issue in DISs to which this paper proposes a partial solution.

 \subsection{Dynamic query optimisation}

@@ -119,7 +187,7 @@ and more particularly those that stipulate any preliminary knowledge of the data

 \lipsum[13-15]

-\subsection{Join size estimation}
+\subsection{Join size estimation}\label{sec:jse}

 \lipsum[8-9]