Index: /reasoner/evaluation.tex
===================================================================
--- /reasoner/evaluation.tex	(revision 229)
+++ /reasoner/evaluation.tex	(revision 230)
@@ -1,23 +1,23 @@
 \section{Evaluation}\label{sectEvaluation}
 
-The goal of this evaluation is to discuss the performance of the actual reasoner implementation, in particular in comparison to the base version that was the starting point for the revision. Therefore, we aim here at a practical and illustrative comparison in the sense of a relaxed technical experiment rather than a fully-fledged technical experiment in an artificial environment. However, the technical measurement support that we describe here would allow for an experiment in a controlled technical environment. 
+The goal of this evaluation is to measure and quantify the performance of the implementation of the IVML reasoner, in particular in comparison to the initial version that we used as starting point for the revision. Therefore, we aim here at a practical and illustrative comparison in the sense of a relaxed technical experiment rather than a fully-fledged technical experiment. However, the technical measurement support that we employ would allow for an experiment in a controlled technical environment. 
 
-The practical perspective on the experiment that we take here allows us to measure the reasoner in some form of application setup, i.e., we consider variant setups including different Java versions (Java 8 and Java 9), different operating systems (Windows, Linux) as well as Eclipse. This allows us to discuss the impact of different Java versions (Java 8 and Java 9), different operating systems (Windows 7, Windows 10, Linux) as well as a measurement setup with and without Eclipse, the latter is similar to using EASy-Producer as a headless library. In contrast, for a strict technical experiment, we would have to ensure that, e.g., only absolutely required services are running or user interface that may influence the measurements is not present. Typically, such strict requirements exclude using Windows as operating system. For the measurements, we focus here on response time and leave other potential interesting performance dimensions like memory usage to future evaluations. 
+The practical perspective on the experiment that we take here allows us to measure the reasoner in some form of application setup, i.e., we consider variant setups including different Java versions (Java 8 and Java 9), different operating systems (Windows 7, Windows 10, Linux) as well as Eclipse as hosting platform. This allows us to discuss the impact of Java versions, operating systems and the platform, i.e., in Eclipse or standalone. In contrast, for a strict technical experiment, we would have to ensure that, e.g., only absolutely required services are running or a user interface (including the Eclipse user interface) that may influence the measurements is not present. Typically, such strict requirements exclude using Windows as operating system. We focus here exclusively on response time and leave other potential interesting performance dimensions like memory usage to future evaluations. 
 
-We discuss in Section \ref{sectEvaluationSetup} the setup of this evaluation, in Section \ref{sectModelComplexity} how to determine the complexity of IVML models and in Section \ref{sectEvaluationResults} the results.
+We present in Section \ref{sectEvaluationSetup} the setup of this evaluation. As the involved IVML models must be ordered for the presentation of the results, we discuss in Section \ref{sectModelComplexity} a pragmatic ranking based on the model complexity. Finally,  in Section \ref{sectEvaluationResults}, we present and discuss the results.
 
 \subsection{Setup}\label{sectEvaluationSetup}
 
-In this section, we discuss the setup of the experiment in terms of subjects, treatments, data collection, experimental procedure.
+In this section, we present the setup of the experiment in terms of subjects, treatments, data collection, and experimental procedure.
 
 \emph{Subjects:} The subjects in this evaluation are two versions of the EASy-Producer SSE reasoner, namely: 
 \begin{itemize}
   \item The original reasoner implementation that acted as basis for the revision. This base version\footnote{\label{reasonerBaseVersion}Git hash 6a00aa9c5aaa37ddb3d490d36c7e9a037e792656} is part of EASy-Producer release 1.1.0, i.e., we will call this original implementation \emph{reasoner v1.1.0}.
-  \item The revised reasoner implementation using the algorithms described in this document. This version\footnote{\label{reasonerActualVersion}Git hash e6cf7dcb850857cecfa4088434f0a717d16234e8. This is rather similar to the implementation in EASy-Producer release 1.2.0, but includes some detail improvements.} will become part of EASy-Producer release 1.3.0, i.e., we will call this \emph{reasoner v1.3.0}.
+  \item The revised reasoner implementation using the algorithms described in this document. This version\footnote{\label{reasonerActualVersion}Git hash e6cf7dcb850857cecfa4088434f0a717d16234e8. This version rather similar to EASy-Producer release 1.2.0, but includes some further improvements.} is part of EASy-Producer release 1.3.0, i.e., we will call it \emph{reasoner v1.3.0}.
 \end{itemize}
 
-\emph{Treatments:} Several test cases of EASy-Producer involve reasoning, in particular the test suites for the SSE reasoner (based on reasoner core test suite), the runtime extension for VIL, the larger scenario test cases (including the models from FP7 QualiMaster \cite{EichelbergerQinSizonenko+16}) as well as the scenario test cases for the BMWi ScaleLog\footnote{These test cases are not publicly available as they contain propretary knowledge of the industrial partner in the ScaleLog project.} project.   We use these test cases as experimental treatments, although this involves test dependencies such as jUnit. While some of these test cases rely on programmed models, most of the test cases specify the underlying model in terms of IVML, i.e., require for execution the IVML parser as well as dependent Eclipse libraries. Moreover, it is important to note that EASy-Producer including reasoner v1.1.0 does not include several test cases that have been created for v1.3.0. For this experiment, we enable for v1.1.0 as many test cases as possible, i.e., we patch back\footnote{\label{fnPatch}Patch is available from \MISSING{XXX}.} the v1.3.0 test cases into v1.1.0 and, if required, either adjust the expected test result accordingly or in the extreme cases disable test cases that cannot be handled by the v1.1.0 reasoner (or the related IVML implementation). For this reason, the treatment sets differ in terms of specific tests, while the most imporant small and large models are the same in both subject sets. We believe that this is acceptable for an illustrative experiment. 
+\emph{Treatments:} Several test cases of EASy-Producer involve reasoning, in particular the test suites for the SSE reasoner (based on the \IVML{ReasonerCore} test suite), the VIL runtime extension, the scenario test cases (including the models from FP7 QualiMaster \cite{EichelbergerQinSizonenko+16}) as well as the scenario test cases for the BMWi ScaleLog\footnote{These test cases are not publicly available as they contain propretary knowledge of the industrial partner in the ScaleLog project.} project.  We use these test cases as experimental treatments, although this involves test dependencies such as jUnit. While some of the test cases rely on programmed models (in terms of the IVML object model), most of the test cases specify the model in terms of IVML, i.e., for a more realistic setup and require for execution the IVML parser as well as dependent Eclipse libraries. As we focus on the reasoning time, the actual creation of the reasoning model shall not affect the results. Moreover, it is important to note that the code for reasoner v1.1.0 does not include several test cases that have been created for v1.3.0. For this experiment, we enable for v1.1.0 as many test cases as possible, i.e., we patch back\footnote{\label{fnPatch}Patch is available from \MISSING{XXX}.} the v1.3.0 test cases into v1.1.0 and, if required, either adjust the expected test result accordingly or, in extreme cases, disable test cases that cannot be handled by the v1.1.0 reasoner (or the underlying IVML implementation). For this reason, the treatment sets differ in terms of specific tests, while most of the imporant small and large models are the same. We believe that this is acceptable for an illustrative experiment. 
 
-\emph{Data collection:} In the test cases mentioned above, we employ a generic measurement data collector, which can be feeded with key-value pairs representing measured (real) values. Collected values are stored when the data collection for a test case is finished. By default, the collector can automatically account for (wall) response time, which can help validating more detailed measures done during the execution of a treatment. For the measurements in this evaluation, we include the default statistics collected by the SSE reasoner such as translation time, evaluation time, number of failed constraints, number of re-evaluated constraints as well as model statistics and complexity measures (cf. Section \ref{sectModelComplexity}) delivered by EASy-Producer. However, EASy-Producer release v1.1.0 did not contain the generic data collector, so, along with the test cases, we had to back-patch\footref{fnPatch} the implementation of the measurement collector from v1.3.0 into v1.1.0. 
+\emph{Data collection:} In the test cases mentioned above, we employ a generic measurement data collector, which can be feeded with key-value pairs representing measured (real) values. Collected values are stored when the data collection for a test case is finished. By default, the collector can automatically account for (wall) response time, which can help validating more detailed time measures collected during the execution of a treatment. For the measurements in this evaluation, we include the default statistics collected by the SSE reasoner, i.e., translation time, evaluation time, number of failed constraints, number of re-evaluated constraints, model statistics, and complexity measures (cf. Section \ref{sectModelComplexity}) delivered by EASy-Producer. However, EASy-Producer release v1.1.0 did not contain the generic data collector and several measures, so, along with the test cases, we patched the related code from v1.3.0 back\footref{fnPatch} into v1.1.0. 
 
 \MISSING{Here}
