Index: /reasoner/evaluation.tex
===================================================================
--- /reasoner/evaluation.tex	(revision 175)
+++ /reasoner/evaluation.tex	(revision 176)
@@ -1,5 +1,7 @@
 \section{Evaluation}\label{sectEvaluation}
 
-The goal of this evaluation is to measure and discuss the performance of the actual implementation reasoner, in particular in comparison to the base version that acted as starting point for the revision. We also compare briefly potential impacts of different Java versions (Java 8 and Java 9) as well as different operating systems (Windows 7, Windows 10, Linux) as during the development of the new reasoner a hardware exchange forced switching the underlying versions of Windows, Java and even Eclipse. We focus here on response time and leave other potential interesting performance dimensions like memory usage for future evaluations. 
+The goal of this evaluation is to discuss the performance of the actual reasoner implementation, in particular in comparison to the base version that was the starting point for the revision. Therefore, we aim here at a practical and illustrative comparison in the sense of a relaxed technical experiment rather than a fully-fledged technical experiment in an artificial environment. However, the technical measurement support that we describe here would allow for an experiment in a controlled technical environment. 
+
+The practical perspective on the experiment that we take here allows us to measure the reasoner in some form of application setup, i.e., we consider variant setups including different Java versions (Java 8 and Java 9), different operating systems (Windows, Linux) as well as Eclipse. This allows us to discuss the impact of different Java versions (Java 8 and Java 9), different operating systems (Windows 7, Windows 10, Linux) as well as a measurement setup with and without Eclipse, the latter is similar to using EASy-Producer as a headless library. In contrast, for a strict technical experiment, we would have to ensure that, e.g., only absolutely required services are running or user interface that may influence the measurements is not present. Typically, such strict requirements exclude using Windows as operating system. For the measurements, we focus here on response time and leave other potential interesting performance dimensions like memory usage to future evaluations. 
 
 We discuss in Section \ref{sectEvaluationSetup} the setup of this evaluation, in Section \ref{sectModelComplexity} how to determine the complexity of IVML models and in Section \ref{sectEvaluationResults} the results.
@@ -7,7 +9,17 @@
 \subsection{Setup}\label{sectEvaluationSetup}
 
-\emph{Data collection:} The testing infrastructure of EASy-Producer implements a generic data collector, which can be feeded with key-value pairs representing measured (real) values. By default, the collector can account for coarse grained response time, which can help validating more detailed measures done during the actual measurement. For the measurements in this evaluation, we include the default statistics collected by the SSE reasoner such as translation time, evaluation time, number of failed constraints, number of re-evaluated constraints as well as model statistics and complexity measures (cf. Section \ref{sectModelComplexity}) delivered by EASy-Producer.
+In this section, we discuss the setup of the experiment in terms of subjects, treatments, data collection, experimental procedure.
 
-\emph{Subjects:} The test cases of EASy-Producer involving reasoning, i.e., the test suites for the SSE reasoner (reused from the reasoner core), the runtime extension for VIL, the scenario test cases (including the models of QualiMaster) as well as the scenario test cases for the BMWi ScaleLog project. It is important to note that the base version of the reasoner\footnote{\label{reasonerBaseVersion}\TBD{Version 1.2.0-SNAPSHOT, git hash}} did not contain the generic data collector, so we had to backpack the original base version. Moreover, several test cases that have been created for testing the advanced features of the recent version of the SSE reasoner are also not included (and cannot be executed on the base version). However, while the subject sets differ in detail, the most imporant small and large models are the same in both subject sets.
+\emph{Subjects:} The subjects in this evaluation are two versions of the EASy-Producer SSE reasoner, namely: 
+\begin{itemize}
+  \item The original reasoner implementation that acted as basis for the revision. This base version\footnote{\label{reasonerBaseVersion}Git hash 6a00aa9c5aaa37ddb3d490d36c7e9a037e792656} is part of EASy-Producer release 1.1.0, i.e., we will call this original implementation \emph{reasoner v1.1.0}.
+  \item The revised reasoner implementation using the algorithms described in this document. This version\footnote{\label{reasonerActualVersion}Git hash e6cf7dcb850857cecfa4088434f0a717d16234e8. This is rather similar to the implementation in EASy-Producer release 1.2.0, but includes some detail improvements.} will become part of EASy-Producer release 1.3.0, i.e., we will call this \emph{reasoner v1.3.0}.
+\end{itemize}
+
+\emph{Treatments:} Several test cases of EASy-Producer involve reasoning, in particular the test suites for the SSE reasoner (based on reasoner core test suite), the runtime extension for VIL, the larger scenario test cases (including the models from FP7 QualiMaster \cite{EichelbergerQinSizonenko+16}) as well as the scenario test cases for the BMWi ScaleLog\footnote{These test cases are not publicly available as they contain propretary knowledge of the industrial partner in the ScaleLog project.} project.   We use these test cases as experimental treatments, although this involves test dependencies such as jUnit. While some of these test cases rely on programmed models, most of the test cases specify the underlying model in terms of IVML, i.e., require for execution the IVML parser as well as dependent Eclipse libraries. Moreover, it is important to note that EASy-Producer including reasoner v1.1.0 does not include several test cases that have been created for v1.3.0. For this experiment, we enable for v1.1.0 as many test cases as possible, i.e., we patch back\footnote{\label{fnPatch}Patch is available from \MISSING{XXX}.} the v1.3.0 test cases into v1.1.0 and, if required, either adjust the expected test result accordingly or in the extreme cases disable test cases that cannot be handled by the v1.1.0 reasoner (or the related IVML implementation). For this reason, the treatment sets differ in terms of specific tests, while the most imporant small and large models are the same in both subject sets. We believe that this is acceptable for an illustrative experiment. 
+
+\emph{Data collection:} In the test cases mentioned above, we employ a generic measurement data collector, which can be feeded with key-value pairs representing measured (real) values. Collected values are stored when the data collection for a test case is finished. By default, the collector can automatically account for (wall) response time, which can help validating more detailed measures done during the execution of a treatment. For the measurements in this evaluation, we include the default statistics collected by the SSE reasoner such as translation time, evaluation time, number of failed constraints, number of re-evaluated constraints as well as model statistics and complexity measures (cf. Section \ref{sectModelComplexity}) delivered by EASy-Producer. However, EASy-Producer release v1.1.0 did not contain the generic data collector, so, along with the test cases, we had to back-patch\footref{fnPatch} the implementation of the measurement collector from v1.3.0 into v1.1.0. 
+
+\MISSING{Here}
 
 \emph{Procedure:} We run the four test suites mentioned above each 5 times to collect response time and model. Typically, each test suite runs individually in a JVM. To compensate delayed JIT optimization, we include a ramp-up run that warms up the JVM. For most test cases, a simple in-memory model with a compound type, a collection over that type and a quantor constraint over the container variable is sufficient. However, for the QualiMaster models, we added as ramp-up run a full run of one of the largest models without accounting for reasoning time or without performing instantiation. \TBD{We execute all tests in a script outside Eclipse to avoid disturbances caused by functionality of the IDE. We execute this procedure on the most recent version of EASy-Producer\footnote{Version 1.3.0-SNAPSHOT, TBD{git-hash}} on an actual development machine, a Dell laptop \TBD{XXX} with Windows 10 and JDK9. We select Windows for a better comparison with the base version and also to measure the reasoner in a typical environment. For comparison, we run the same version of EASy-Producer on a Dell laptop \TBD{XXX} with Windows 7 and JDK8. For both windows machines, we disable first the virus scanner and terminate all programs that are not required for the execution of the tests. On the Windows 7 machine we also run the base version of the reasoner\footref{reasonerBaseVersion}. Finally, for curiosity, we run the the test execution script also on our continuous integration server, a Linux... VM at a point in time when no Jenkins tasks are running.}
@@ -107,5 +119,5 @@
 For short, we enable considering nested variables and the contents of expressions. Compounds and containers are considered with a double weight compared with all other types. Most of the constraint tree nodes are weighted by 1 including access to the actual instance of a compound (\IVMLself{}), except for container iterator operations that we weight by a higher and access to the value of decision variables and constants that we weight by a lower value. For a given IVML model, we calculate $cpx_v(cfg) + cpx_c(cfg)$ as overall complexity measure, because ratio-based measures as in the more homogeneous feature case mentioned above do not seem to correctly reflect the complexity of IVML models.
 
-\subsection{Results}\label{sectResults}
+\subsection{Results}\label{sectEvaluationResults}
 
 \TBD{Potential topics: