Index: /reasoner/conclusion.tex
===================================================================
--- /reasoner/conclusion.tex	(revision 273)
+++ /reasoner/conclusion.tex	(revision 274)
@@ -9,2 +9,4 @@
     \item quantification unrolling
 \end{itemize}
+
+Pi speedup needed!
Index: /reasoner/evaluation.tex
===================================================================
--- /reasoner/evaluation.tex	(revision 273)
+++ /reasoner/evaluation.tex	(revision 274)
@@ -15,7 +15,9 @@
 The subjects in this evaluation are two versions of the EASy-Producer SSE reasoner, namely: 
 \begin{itemize}
-  \item The original reasoner implementation that acted as basis for the revision. This base version\footnote{\label{reasonerBaseVersion}Git hash 6a00aa9c5aaa37ddb3d490d36c7e9a037e792656} is part of EASy-Producer release 1.1.0, i.e., we will call this original implementation \emph{reasoner v1.1.0}.
-  \item The revised reasoner implementation using the algorithms described in this document. This version\footnote{\label{reasonerActualVersion}Git hash e0b09aec719d5387255e2b979d90708f05364e75. This version rather similar to EASy-Producer release 1.2.0, but includes some further improvements.} is part of EASy-Producer release 1.3.0, i.e., we will call it \emph{reasoner v1.3.0}.
-\end{itemize}
+  \item The original reasoner implementation that acted as basis for the revision. This base version\footnote{\label{reasonerBaseVersion}Git hash 6a00aa9c5aaa37ddb3d490d36c7e9a037e792656} is part of EASy-Producer release 1.1.0, i.e., we will call this original implementation \emph{reasoner version 1.1.0}.
+  \item The revised reasoner implementation using the algorithms described in this document. This version\footnote{\label{reasonerActualVersion}Git hash e0b09aec719d5387255e2b979d90708f05364e75. This version rather similar to EASy-Producer release 1.2.0, but includes some further improvements.} is part of EASy-Producer release 1.3.0, i.e., we will call it \emph{reasoner version 1.3.0}.
+\end{itemize}
+
+It is important to note that these two versions differ in functionality and degree of performance optimization. Version 1.3.0 as discussed in this report is significantly more IVML-complete than version 1.1.0, e.g., in terms of constraint translation capabilities. Moreover, version 1.3.0 implements additional functionality, in particular the instantiation of a reasoner for a given model/configuration to enable re-use of the constraint base. While modifying EASy-Producer version 1.1.0 for this experiment, we only touched the reasoner implementation  to obtain comparable measurements (cf. Section \ref{sectEvalSetupDataCollection}). Thus, we did not port back the functionality of re-using of the constraint base into version 1.1.0. 
 
 \subsubsection{Treatments}\label{sectEvalSetupTreatments}
@@ -31,5 +33,5 @@
 We use these test cases as experimental treatments, although they involve test dependencies such as jUnit. While some of the test cases rely on programmed models (in terms of the IVML object model), most of the test cases specify the model in terms of IVML, i.e., for a more realistic setup and require for execution the IVML parser as well as dependent Eclipse libraries. As we focus on the reasoning time, the actual creation of the reasoning model shall not affect the results. 
 
-It is important to note that the code for reasoner v1.1.0 does not include several test cases that have been created for v1.3.0. For this experiment, we enable for v1.1.0 as many test cases as possible, i.e., we manually ported back\footnote{\label{fnPatch}Patch is available from \MISSING{XXX}.} the v1.3.0 test cases into v1.1.0 and, if required, either adjusted the expected test result accordingly or, in extreme cases, disable test cases that cannot be handled by the v1.1.0 reasoner (or the underlying IVML implementation). For this reason, the treatment sets differ in terms of specific tests, while most of the imporant small and large models are the same. We believe that this is acceptable for an illustrative experiment. 
+It is important to note that the code for reasoner version 1.1.0 does not include several test cases that have been created for version 1.3.0. For this experiment, we enable for version 1.1.0 as many test cases as possible, i.e., we manually ported back\footnote{\label{fnPatch}Patch is available from \MISSING{XXX}.} the version 1.3.0 test cases into version 1.1.0. If required, we either adjusted the expected test result accordingly or, in extreme cases, disable test cases that cannot be handled by the version 1.1.0 reasoner (or the underlying IVML implementation). For this reason, the treatment sets differ in terms of specific tests, while most of the important small and large models are the same. We believe that this is acceptable for an illustrative experiment. 
 
 \subsubsection{Data Collection}\label{sectEvalSetupDataCollection}
@@ -49,7 +51,7 @@
 During the experiment, we execute each of the test suites in Section \ref{sectEvalSetupTreatments}. Due to their use in continuous integration, each test suite runs in an own JVM instance. To compensate delayed just-in-time (JIT) compilation, we include a specific ramp-up test in the experimental runs to warm up the JVM.  For most test cases, reasoning over a simple representative model including a compound type, a collection over that type and a quantor constraint over the container variable seems to be sufficient. For the QualiMaster models, we added as ramp-up run a full run of one of the largest models without accounting for reasoning time. If the test cases include artifact instantiation trough VIL, we disable the instantiation phase.
 
-However, pilot experiments showed that still significant differences between the first runs of a test case and subsequent runs may occur. Thus, within each suite, we repeat the reasoning functionality of each test case 10 times on a fresh configuration. These 10 (here intuitively chosen) repetitions make up the intra-experiment repetitions. In particular, the repetitions allow for (later) excluding  warmup runs as well as for basic descriptive statistics such as confidence intervals \cite{GeorgesBuytaertEeckhout07}. 
-
-On a given machine/device, we first perform an initial run and then the experimental runs. During the initial run, we execute the test suites on the target device to validate that all tests are passed successfully. The measurements of this run are stored separately. For taking the experimental measures, we repeate the execution of the test suites 5 times (inter-experiment repetition) with 5 seconds pause between two subsequent runs.
+However, pilot experiments showed that still significant differences between the first runs of a test case and subsequent runs may occur. Thus, within each suite, we repeat the reasoning functionality of each test case 10 times on a fresh IVML configuration. These 10 (here intuitively chosen) repetitions make up the intra-experiment repetitions. In particular, the repetitions allow for (later) excluding  warmup runs as well as for basic descriptive statistics such as confidence intervals \cite{GeorgesBuytaertEeckhout07}. 
+
+On a given machine/device, we first perform an initial run and then the experimental runs. During the initial run, we execute the test suites on the target device to validate that all tests are passed successfully. The measurements of this run are stored separately. For taking the experimental measures, we repeat the execution of the test suites 5 times (inter-experiment repetition) with 5 seconds pause between two subsequent runs.
 
 Executing the test suites is realized as an ANT script, because collecting all dependencies for the standalone version of EASy-Producer is not trivial. For this task, we reuse the respective part of the ANT build mechanism from the continuous integration. ANT supports executing jUnit test suites based on a classpath constructed from the dependencies, so executing the test suites described in Section \ref{sectEvalSetupTreatments} is rather straightforward. The inclusion of the ramp-up tests, the intra-experiment and inter-experiment repetitions as well as the waiting time are configured in the ANT script and passed to the EASy-Producer test suites via environment parameters. For separating the initial and the experimental runs, the ANT script defines two specific tasks. In turn, the ANT script itself can be used as build action in the continuous integration, i.e., for collecting performance readings on the continuous integration server, e.g., for detecting performance degradation.
@@ -63,5 +65,30 @@
   \item\label{piId} A Raspberry Pi 3 by vendor element14 hosting an 8 GB SanDisk class-4 SD (a device used in \cite{KnocheEichelberger18}) with Raspbian Stretch Lite version November 2018, Linux Kernel 4.14 and Oracle JDK 1.8.\TBD{65} ARM. % and, as alternative for some experiments, Oracle JDK 1.8.0\_201 ARM.
 \end{enumerate}
+
 For both windows machines, we terminate all programs that are not required for the execution of the tests (leaving the virus scanner in operation as usual during development). 
+
+\begin{table*}[ht]
+\centering
+%\begin{adjustbox}{angle=90}
+\begin{tabular}{|c|c|c|c|}
+\hline
+\textbf{Experiment} & \multicolumn{2}{|c|}{\textbf{Setup}} & \textbf{Treatment } \\
+\textbf{id}                & \textbf{device}  & \textbf{JDK}          & \textbf{(reasoner version)} \\
+\hline
+1 & \ref{oldLaptopId}   & 1.8.0 & 1.1.0 \\
+2 & \ref{oldLaptopId}   & 1.8.0 & 1.3.0 \\
+3 & \ref{newLaptopId} & 1.8.0* & 1.3.0 \\
+4 & \ref{newLaptopId} & 10* & 1.3.0\\
+5 & \ref{jenkinsId}        & 1.8.0 & 1.3.0 \\
+6 & \ref{piId}                & 1.8.0 & 1.1.0 \\
+7 & \ref{piId}                & 1.8.0 & 1.3.0\\
+\hline
+\end{tabular}
+\caption{Experiment setup (* = OpenJDK).}
+\label{tab:experimentsCombinations}
+%\end{adjustbox}
+\end{table*}
+
+We combine the treatments and devices as shown in Table \ref{tab:experimentsCombinations}. On the original device \ref{oldLaptopId}, we execute both versions of the reasoner on the same JDK (experiment id 1 and 2). On the newer laptop \ref{newLaptopId} we continue with the more recent reasoner version in a more modern (id 3 and 4). We utilize two different JDKs, Java 8 to enable a comparison with \ref{oldLaptopId}, and Java 10 to give an outlook on more recent JVMs. However, due to the more modern installation of \ref{newLaptopId}, we have to compare Oracle vs.~Open JDK. On the continuous integration machine (\ref{jenkinsId}, id 5) we measure reasoner version 1.3.0, i.e., the most recent commit\footref{reasonerActualVersion} at that time. Finally, on the Pi device \ref{piId} we measure both reasoner versions on the same JDK (id 6 and 7), here JDK 1.8 as no more recent JDK was available for the ARM architecture when conducting the experiments.
 
 \subsubsection{Analysis}\label{sectEvalSetupAnalysis}
@@ -213,17 +240,17 @@
 \begin{tabular}{|c|c|c|c||r|r|r|r|r|r|r||r|}
 \hline
-\textbf{id} & \textbf{de-}      & \textbf{ver-} &                       & \textbf{test}                    & \multicolumn{2}{|c|}{\textbf{reasoning}} & \multicolumn{2}{|c|}{\textbf{constraints}} & \multicolumn{2}{|c||}{\textbf{evaluations}} & \textbf{total} \\
-                  & \textbf{vice}     & \textbf{sion}  & \textbf{JDK} & \centerCell{\textbf{[\#]}} & \multicolumn{2}{|c|}{\textbf{time [ms]}} & \multicolumn{2}{|c|}{\textbf{[\#]}} &
+\textbf{id} & \textbf{de-}      &                       & \textbf{ver-}              & \textbf{test}                    & \multicolumn{2}{|c|}{\textbf{reasoning}} & \multicolumn{2}{|c|}{\textbf{constraints}} & \multicolumn{2}{|c||}{\textbf{evaluations}} & \textbf{total} \\
+                  & \textbf{vice}     & \textbf{JDK}  & \textbf{sion} & \centerCell{\textbf{[\#]}} & \multicolumn{2}{|c|}{\textbf{time [ms]}} & \multicolumn{2}{|c|}{\textbf{[\#]}} &
 \multicolumn{2}{|c||}{\textbf{[\#]}} & \textbf{time}\\
                   &                          &                       &                       &                                            & \textbf{avg} & \textbf{max}                     & \textbf{avg} & \textbf{max} & 
 \textbf{avg} & \textbf{max} & \textbf{[min]}\\
 \hline
-1 & \ref{oldLaptopId}   & 1.1.0 & 1.8.0   & 400 & 17 & 670       & 723   & 24110   & 1092 & 46469 & 29\\
-2 & \ref{oldLaptopId}   & 1.3.0 & 1.8.0   & 433 & 11 & 390       & 53     & 6000     & 603   & 23696 & 46\\
-3 & \ref{newLaptopId} & 1.3.0 & 1.8.0* & 433 & 14 & 502       & 53     & 6000     & 603   & 23696 & 55\\
-4 & \ref{newLaptopId} & 1.3.0 & 10*      & 433 & 15  & 493      & 53     & 6000     & 603   & 23696 & 55\\
-5 & \ref{jenkinsId}        & 1.3.0 & 1.8.0   & 427 & 13  & 480      & 53     & 6000     & 494  & 19466 & 104\\
-6 & \ref{piId}                & 1.1.0 & 1.8.0   & 400 & 548 & 28613 & 723   & 24110   & 1092 & 46477 & 366\\
-7 & \ref{piId}                & 1.3.0 & 1.8.0   & 433 & 181 & 6065   & 53     & 6000     & 603   & 23706 & 795\\
+1 & \ref{oldLaptopId}   & 1.8.0   & 1.1.0 & 400 & 17 & 670       & 723   & 24110   & 1092 & 46469 & 29\\
+2 & \ref{oldLaptopId}   & 1.8.0   & 1.3.0 & 433 & 11 & 390       & 53     & 6000     & 603   & 23696 & 46\\
+3 & \ref{newLaptopId} & 1.8.0* & 1.3.0 & 433 & 14 & 502       & 53     & 6000     & 603   & 23696 & 55\\
+4 & \ref{newLaptopId} & 10*      & 1.3.0 & 433 & 15  & 493      & 53     & 6000     & 603   & 23696 & 55\\
+5 & \ref{jenkinsId}        & 1.8.0   & 1.3.0 & 427 & 13  & 480      & 53     & 6000     & 494  & 19466 & 104\\
+6 & \ref{piId}                & 1.8.0   & 1.1.0 & 400 & 548 & 28613 & 723   & 24110   & 1092 & 46477 & 366\\
+7 & \ref{piId}                & 1.8.0   & 1.3.0 & 433 & 181 & 6065   & 53     & 6000     & 603   & 23706 & 795\\
 \hline
 \end{tabular}
@@ -234,71 +261,19 @@
 \end{table*}
 
-Table \ref{tab:experimentsDescSummary} summarizes for the initial run the number of tests, the measured reasoning time, the created number of constraints, and the performed constraint evaluations. As an indication of the overall time consumption of the experiment, we list also the total execution time for the 5 repeated executions of the testsuites. At a glance, the number of of tests, constraints and evaluations shall be the same for the all treatments and might appear to be irrelevant here. However, as the treatments differ (cf. Section \ref{sectEvalSetupTreatments}), we list also these numbers. In particular, in the pre-experiments these numbers helped us to identify minor bugs in experiments that accidentally caused the evaluation of same rather than different models in the extended QualiMaster cases. We do not show here the minimum reasoning time, number of constraints or evaluations, respectively, as the minimum number is usually 0 due to IVML test cases without constraints. 
-
-On the original device \ref{oldLaptopId}, we execute both versions of the reasoner on the same JDK (id 1 and 2). On the newer laptop \ref{newLaptopId} we continue with the more recent reasoner version in a more modern (id 3, 4). In this setup, we use two different JDKs, Java 8 to enable a comparison with \ref{oldLaptopId}, and Java 10 to give an outlook on more recent JVMs. On the continuous integration machine (\ref{jenkinsId}, id 5) we just measure reasoner version 1.3.0 corresponding to the most recent commit\footref{reasonerActualVersion} at that time. Finally, on the Pi device \ref{piId} we measure both reasoner versions on the same JDK (id 6 and 7).
-
-As mentioned in Section \ref{sectEvalSetupTreatments}, the treatments differ between the reasoner versions (id 1/2 as well as 6/7) as not all tests could be ported back successfully. Moreover, the number of tests on Jenkins (\ref{jenkinsId}, id 5) differs, as some tests involving VIL are disabled there due to technical issues. Except for the Pi \ref{piId}, the average reasoning time is rather similar across the devices and JDKs. However, according to the maximum reasoning time, reasoner version 1.3.0 operates faster than version 1.1.0. It is important mentioning that reasoner version 1.3.0 is both, more IVML complete and better tuned for performance as discussed in Section \ref{sectPerformance}. Moreover, reasoner version 1.1.0 even creates significantly more (including accidentally ineffective) constraints and, thus, performs more re-evaluations, i.e., consumes more reasoning time. This can also be identified in terms of constraints in the constraint base as well as the number of re-evaluations. Jenkins (\ref{jenkinsId}, id 5) appears to be a bit faster at lower re-evaluations, which is due to some disabled test cases.
-
-Regarding the total execution time, id 1 vs. 2 may appear as an outlier. However, we observed this behavior several times. Moreover, the difference between the reasoner versions is also evident on the Pi device in id 6 and 7. As besides pure reasoning experiments also IVML and VIL tests are executed, we attribute the significant increrase in overal experimentation time to the differences in the number/complexity of treatments, possibly also to changes in the IVML/VIL code base of EASy-Producer. It is important to note that complex and long-running VIL instantiations such as the code generation in the QualiMaster models are disabled by the experiment script. 
-
-A comparison of id 2 and id 3 suggests that an OpenJDK on Windows 10 behaves worse than Oracle JDK on a rather old Windows 7 installation in this experiment. This impression is confirmed by the average reasoning time in the repeated experiments (maximum reasoning time 336 ms for id 1 and 485 ms for id 3). OpenJDK 10 seems to be slightly faster (with an even smaller difference in the repeated experiments). Although we executed the experiments on Jenkins (id 5) in a separate build task prevening parallel builds, the overall execution time is almost twice as high as on a laptop. This is in contrast to the reasoning times, which are pretty similar to the other experiments for reasoner version 1.3.0, i.e., again reading the IVML model or running some VIL instantiations may cause the high overall execution time. Potential reasons may also be due to the setup, i.e., limited main memory size of 4 GBytes, an almost full virtual hard drive, virtualization overhead in particular for I/O operation when loading IVML and VIL models, etc.
-
-Based on \cite{KnocheEichelberger18}, we expected an average performance drop of factor 10 for the Pi experiments (id 6 and 7). \TBD{discuss}
-
-In summary, reasoner version 1.3.0 is more IVML complete and faster than reasoner version 1.1.0. However, we cannot confirm the huge differences reported in Section \ref{sectEvalSetupTreatments}. To identify the reason, we also executed the experiments in Eclipse (cf. Section \ref{sectEvalSetupProcedure}). When executing the ANT measurement script in terms of an external tool configuration, we did not notice significant timing differences. However, during development, one typically executes a jUnit test individually using the Eclipse jUnit test execution feature. Running the tests suites as JUnit tests in Eclipse doubled the average reasoning time for version 1.1.0 on \ref{oldLaptopId}. The reasoning time for individual models raised on \ref{oldLaptopId} even by more than factor 5. Moreover, working in a  mobile setting, i.e., executing the tests on an unplugged laptop, may have further increased the reasoning time leading to a wrong impression of the overall performance gain.
+Table \ref{tab:experimentsDescSummary} summarizes the results for the setups introduced in Table \ref{tab:experimentsCombinations}. We show for the initial run the number of tests, the measured reasoning time, the created number of constraints, and the performed constraint evaluations. We do not show here the minimum reasoning time, number of constraints or evaluations, respectively, as the minimum number is usually 0 due to IVML test cases without constraints. As an indication of the overall time consumption of the experiment, we list also the total execution time for the 5 repeated executions of the test suites. 
+
+At a glance, the number of of tests, constraints and evaluations shall be the same for the all treatments and might appear to be irrelevant here. However, as the treatments differ (cf. Section \ref{sectEvalSetupTreatments}), it is important indicating also these numbers. In particular, in the pre-experiments the readings helped us identifying minor bugs in the experiments that accidentally caused the evaluation of same rather than different models in the extended QualiMaster cases. In more details,  the \emph{treatments differ} between the reasoner versions (id 1/2 as well as 6/7) as not all tests could be ported back. Moreover, the number of tests on Jenkins (\ref{jenkinsId}, id 5) differs, as some tests involving VIL are disabled there due to technical and memory issues. Except for the Pi \ref{piId}, the average reasoning time is rather similar across the devices and JDKs. However, according to the maximum reasoning time, reasoner version 1.3.0 operates faster than version 1.1.0. It is important mentioning that reasoner version 1.3.0 is both, more IVML complete and better tuned for performance as discussed in Section \ref{sectPerformance}. Moreover, reasoner version 1.1.0 even creates significantly more (including accidentally ineffective) constraints and, thus, performs more re-evaluations, i.e., consumes more reasoning time. This can also be identified in terms of constraints in the constraint base as well as the number of re-evaluations. Jenkins (\ref{jenkinsId}, id 5) appears to be a bit faster at lower re-evaluations, which is due to some disabled test cases.
+
+Regarding the \emph{total execution time}, id 1 vs.~2 may appear as an outlier. However, we observed this behavior several times. Moreover, the difference between the reasoner versions is also evident on the Pi device in id 6 and 7. As besides pure reasoning experiments also IVML and VIL tests are executed, we attribute the significant increase in overal experimentation time to the differences in the number/complexity of treatments, possibly also to changes in the IVML/VIL code base of EASy-Producer. It is important to note that complex and long-running VIL instantiations such as the code generation in the QualiMaster models are disabled by the experiment script. 
+
+A comparison of id 2 and id 3 suggests that \emph{OpenJDK} on Windows 10 behaves worse than \emph{Oracle JDK} on a rather old Windows 7 installation in this experiment. This impression is confirmed by the average reasoning time in the repeated experiments (maximum reasoning time 336 ms for id 1 and 485 ms for id 3). OpenJDK 10 seems to be slightly faster (with an even smaller difference in the repeated experiments). Although we executed the experiments on Jenkins (id 5) in a separate build task prevening parallel builds, the overall execution time is almost twice as high as on a laptop. This is in contrast to the reasoning times, which are pretty similar to the other experiments for reasoner version 1.3.0, i.e., again reading the IVML model or running some VIL instantiations may cause the high overall execution time. Potential reasons may also be due to the setup, i.e., limited main memory size of 4 GBytes, an almost full virtual hard drive, virtualization overhead in particular for I/O operation when loading IVML and VIL models, etc.
+
+Based on the experiments in~\cite{KnocheEichelberger18}, we expected for the \emph{Pi experiments} (id 6 and 7) an average performance drop of about factor 10, in particular as the main resource usage here is CPU and memory rather than I/O. Reasoner version 1.1.0 (id 6) is actually more than factor 34 slower in comparision with id 1. Reasoner version 1.3.0 (id 7 vs.~2) is closer to the expectation, as the Pi execution in id 7 is factor 12 slower than id 3 (and factor 16-20 slower than id 2). There are also slight deviations in the maximum number of evaluations (id 1 vs id 6 and id 2 vs id 7), for which we do not have an explanation right now. In the context of this discussion, the rather high total execution times fit to a slowdown of factor 12-14. As probably I/O or swapping is the main cause, utilizing a faster SD-card or an external harddrive as discussed in~\cite{KnocheEichelberger18} may speed up the experiment significantly. However, the main focus of this experiment is on the behavior of the reasoner, in particular the reasoning time. Here, reasoner version 1.3.0 (id 6 vs.~7) performs significantly better on a Pi than version 1.1.0 (factor 3-5).
+
+We conclude that reasoner version 1.3.0 is by construction more IVML complete, creates and evaluates less constraints and, in summary, performs faster than reasoner version 1.1.0. However, we cannot confirm the huge differences reported in Section \ref{sectEvalSetupTreatments}. To identify the reason for this discrepancy, we also executed the experiments in Eclipse (cf. Section \ref{sectEvalSetupProcedure} for version details). When executing the ANT measurement script as an external tool configuration, we did not notice significant timing differences. However, during development, one typically executes a jUnit test individually using the Eclipse jUnit execution feature. Running the tests suites using that feature, we observed an increase of the average reasoning time by factor 2 for version 1.1.0 on laptop \ref{oldLaptopId}, while the reasoning time for individual models raised even by more than factor 5. Moreover, working in a  mobile setting, i.e., executing the tests on an unplugged laptop, may have further increased the reasoning time leading to a wrong impression of the overall performance gain.
 
 \subsubsection{Individual Setups}
 
-Four groups: reasoning core tests, scenario tests, incremental reasoning, runtime/instance reasoning
-
-Figure \ref{fig:old-1_10-jdk8}:
-\begin{itemize}
-  \item Fluctuations due To Windows, UI?
-  \item Translation/evaluation time low, less constraints
-  \item Sum of translation/evaluation time does not lead to full reasoning time. Additional activities, see Section \ref{sectPerformance} in particular result handling.
-\end{itemize}
-
-Figure \ref{fig:old-1_30-jdk8}:
-\begin{itemize}
-    \item requires less time
-    \item produces less constraints, less re-evaluations (depends on model)
-    \item incremental slightly faster
-    \item instance: memory allocation in reasoning time, no translation/evaluation time
-    \item all reasoning/evaluation 1:1, sum up to reasoning time
-\end{itemize}
-
-Figure \ref{fig:new-1_30-ojdk8}:
-\begin{itemize}
-    \item new computer, windows, OpenJDK 8 slower
-    \item less big spikes (different installation)
-\end{itemize}
-
-Figure \ref{fig:new-1_30-ojdk10}:
-\begin{itemize}
-    \item even slower, but even less spikes
-\end{itemize}
-
-Figure \ref{fig:jenkins-1_30-ojdk10}:
-\begin{itemize}
-    \item \textbf{data missing?}
-\end{itemize}
-
-Figure \ref{fig:pi-1_10-jdk8}:
-\begin{itemize}
-    \item factor 60 slower
-    \item evaluation speed
-    \item no time difference for larger models
-    \item as version 1.1.0, 200+200 vs 30.000? output problems
-    \item \textbf{data missing?}
-\end{itemize}
-
-Figure \ref{fig:pi-1_30-jdk8}:
-\begin{itemize}
-    \item less spiky than windows and jenkins
-    \item no time difference for larger models
-    \item factor 13 slower - ICPE paper)
-    \item times sum up
-    \item \textbf{data missing?}
-\end{itemize}
+We discuss now the detailed results for the individual setups and experiments. We present diagrams of reasoning, translation and evaluation time obtained from the repeated experiment runs. The diagrams indicate the 95\% confidence interval of the measurements of each test case as error bars. 
 
 \newcommand\evalPDFscale[0]{0.67}
@@ -310,5 +285,5 @@
   \hfill
   \includegraphics[scale=\evalPDFscale]{figures/benchmark-results-old-1_1_0-jdk1_8-20190308_evaluationTime_all.pdf}
-  \caption{Reasoner 1.10 on the Dell Latitude 6430u using Oracle JDK 1.8.}\label{fig:old-1_10-jdk8}
+  \caption{Reasoner 1.1.0 on \ref{oldLaptopId} (Dell Latitude 6430u) using Oracle JDK 1.8.}\label{fig:old-1_10-jdk8}
 \end{figure}
 
@@ -322,8 +297,12 @@
   \hfill
   \includegraphics[scale=\evalPDFscale]{figures/benchmark-results-old-1_3_0-jdk1_8-20190308_evaluationTime_all.pdf}
-  \caption{Reasoner 1.30 on the Dell Latitude 6430u using Oracle JDK 1.8.}\label{fig:old-1_30-jdk8}
-\end{figure}
-
-\clearpage
+  \caption{Reasoner 1.3.0 on \ref{oldLaptopId} (Dell Latitude 6430u) using Oracle JDK 1.8.}\label{fig:old-1_30-jdk8}
+\end{figure}
+
+\clearpage
+
+In the diagrams, we separate visually two main groups of test cases, namely reasoning core tests and scenario tests, the latter including the tests for the ScaleLog and QualiMaster models. Both groups are sub-divided into full reasoning, incremental reasoning (partial constraint base) and runtime-reasoning (re-used partial constraint base). For each group, we indicate a (desirable) linear regression of the measurements (dotted lines) with surrounding 95\% confidence region. Depending on the reasoner version, the treatment sets differ (cf. Section \ref{sectEvalSetupTreatments}), e.g., no re-used partial constraint based tests are available for reasoner version 1.1.0 as discussed in Sections \ref{sectEvalSetupSubjects} and \ref{sectEvalSetupTreatments}. 
+
+Figures \ref{fig:old-1_10-jdk8} and \ref{fig:old-1_30-jdk8} illustrate the measurements for the two reasoner versions on \emph{laptop \ref{oldLaptopId}} (experiment id 1 and 2). While the scale of the obtained measurements for the \textit{full reasoning} time is roughly the same for both versions (reasoner version 1.3.0 is faster as discussed in Section \ref{sectEvaluationResultsDescriptive}), the scales for translation and evaluation time differ. In other words, for reasoner version 1.3.0, translation and evaluation time sum up to the reasoning time as expected, while this does not hold for reasoner version 1.1.0. We attribute this (as also evident in the code) to additional operations performed before, between or after translation/evaluation, e.g., by unnecessary creation of configuration instances, non-incremental and superfluous freezing of IVML variables, or by output can be deferred after reasoning time (cf. Section \ref{sectPerformance}). While version 1.1.0 may appear to be faster in translation and evaluation time, it is important to recall that the translation algorithms of version 1.1.0 are not IVML-complete. Moreover, version 1.3.0 behaves more linearly, i.e., measured times, in particular reasoning time, have less deviation from the desired, while version 1.1.0 in Figure \ref{fig:old-1_10-jdk8} tends to a quadratic behavior. \textit{Incremental reasoning} for version 1.3.0 is faster than for version 1.1.0, although again, translation and evaluation in detail are slower due to a higher degree of IVML-completeness. \textit{Re-using the constraint base} by instantiating the reasoner for a certain IVML model (here just changing a few configuration variables) does not seem to allocate (much) time in translation and evaluation. In contrast, regarding total reasoning time, this form of runtime reasoning requires 20-50ms, which we attribute to the transfer of constraints between the stored and the actual constraint base.
 
 \begin{figure}[!htb]
@@ -334,5 +313,5 @@
   \hfill
   \includegraphics[scale=\evalPDFscale]{figures/benchmark-results-new-1_3_0-ojdk1_8-20190308_evaluationTime_all.pdf}
-  \caption{Reasoner 1.30 on the Dell Latitude 7490 using Open JDK 1.8.}\label{fig:new-1_30-ojdk8}
+  \caption{Reasoner 1.3.0 on \ref{newLaptopId} (Dell Latitude 7490) using Open JDK 1.8.}\label{fig:new-1_30-ojdk8}
 \end{figure}
 
@@ -346,8 +325,16 @@
   \hfill
   \includegraphics[scale=\evalPDFscale]{figures/benchmark-results-new-1_3_0-ojdk10-20190308_evaluationTime_all.pdf}
-  \caption{Reasoner 1.30 on the Dell Latitude 7490 using Open JDK 10.}\label{fig:new-1_30-ojdk10}
-\end{figure}
-
-\clearpage
+  \caption{Reasoner 1.3.0 on \ref{newLaptopId} (Dell Latitude 7490) using Open JDK 10.}\label{fig:new-1_30-ojdk10}
+\end{figure}
+
+\clearpage
+
+Figures \ref{fig:new-1_30-ojdk8} and \ref{fig:new-1_30-ojdk10} illustrate the measurements for the two reasoner versions on \emph{laptop \ref{newLaptopId}} (experiment id 3 and 4). Except for the slower total performance on \ref{newLaptopId}, version 1.3.0 behaves rather similar to version 1.3.0 on \ref{oldLaptopId}. Surprisingly, the deviations of the measurements for the two devices differ significantly. It appears, that the combination of Windows 10 and OpenJDK allows for more stable and better repeatable measures than Windows 7 and Oracle JDK. This impression may be deceptive for many reasons, e.g., that the figures illustrate the results of the repeated experiments or as the installed software on the devices differs. 
+\begin{itemize}
+  \item The initial runs for \ref{newLaptopId} have some more deviations than indicated in Figures \ref{fig:new-1_30-ojdk8} and \ref{fig:new-1_30-ojdk10}, but only for a few individual test cases, in particular the largest QualiMaster model. The initial runs on \ref{oldLaptopId} have similar deviations to Figures \ref{fig:old-1_10-jdk8} and \ref{fig:old-1_30-jdk8}, in particular for (the incremental reasoning on) the largest QualiMaster model. Here, the setup with repetitions with many more stable runs may have reduced the width of the 95\% confidence intervals excluding some few extreme values. 
+  \item Windows 7 on \ref{oldLaptopId} is outdated and automatic updates do not work anymore, more precisely Windows patches are installed, configured and deactivated by Windows in long boot loops leaving more and more patches uninstalled. Furthermore, some software on \ref{oldLaptopId} was identified as irrelevant for \ref{newLaptopId} and different or more recent software was installed on \ref{newLaptopId}. So it is rather difficult if not even impossible to attribute the reduced deviations to some root causes, but, contrasting many research papers focusing on Linux for stability, a more detailed analysis of the stability of Windows might lead to interesting results.
+\end{itemize}
+
+In summary, for a more modern machine, \ref{newLaptopId} appears to be not significantly faster than \ref{oldLaptopId} although \ref{newLaptopId} has twice as much memory as \ref{oldLaptopId}. While we would expect also the SSD of \ref{newLaptopId} to be faster than the one of \ref{oldLaptopId}, there is no real speedup of the overall experiment time. Actually, \ref{newLaptopId} is slower as shown in Table \ref{tab:experimentsDescSummary}. However, the real difference between \ref{newLaptopId} and \ref{oldLaptopId} is the number of physical cores that are not exploited by the reasoner. Executing a virtualized computer while interacting with the host system is much smoother on \ref{newLaptopId} than on \ref{oldLaptopId}. Thus, reasoning on multiple models in parallel (processes) may be faster on \ref{newLaptopId}, but this is currently out of scope.
 
 \begin{figure}[!htb]
@@ -358,8 +345,12 @@
   \hfill
   \includegraphics[scale=\evalPDFscale]{figures/benchmark-results-jenkins-1_3_0-jdk1_8-20190308_evaluationTime_all.pdf}
-  \caption{Reasoner 1.30 on Jenkins Ubuntu Open JDK 1.8.}\label{fig:jenkins-1_30-ojdk10}
-\end{figure}
-
-\clearpage
+  \caption{Reasoner 1.30 on \ref{jenkinsId} (Jenkins Ubuntu) Open JDK 1.8.}\label{fig:jenkins-1_30-ojdk10}
+\end{figure}
+
+\clearpage
+
+Figures \ref{fig:jenkins-1_30-ojdk10} depicts the results for the \emph{continuous integration machine \ref{jenkinsId}} (experiment id 5). As expected due to the descriptive statistics in Table \ref{tab:experimentsDescSummary}, the obtained time readings are similar to the already discussed figures for version 1.3.0 (Figures \ref{fig:old-1_30-jdk8}, \ref{fig:new-1_30-ojdk8} and \ref{fig:new-1_30-ojdk10}). However, the results also indicate significantly more as well as larger deviations. It is important to recall here that we configured the build task for this experiment in a way that other build tasks are not executed in parallel on the same machine. Thus, we believe that the background processes of a continuous integration machine are responsible for these deviations, e.g., checking regularly for new commits for more than 80 build tasks, running an Apache webserver frontend and Jenkins as backend, regularly cleaning local repositories, etc.
+
+Except for the absolute runtimes, i.e., the scales of the diagrams, the readings for the \emph{Pi 3} (\ref{piId}, experiment id 6 and 7) shall be rather similar to Figures \ref{fig:old-1_10-jdk8} and \ref{fig:old-1_30-jdk8}. Indeed, similar to Figure \ref{fig:old-1_10-jdk8}, Figure \ref{fig:pi-1_10-jdk8} illustrates in an even more drastic way that translation and evaluation time do not sum up to reasoning time (experiment id 1 vs.~6). Both, translation and evaluation time are roughly factor 10 slower than on device \ref{oldLaptopId} and, thus, fit rather well to the expected slowdown of factor 10 for CPU intensive tasks on a Pi 3. Due to different algorithms and optimizations, version 1.3.0 in Figure \ref{fig:pi-1_30-jdk8} is rather similar to Figure \ref{fig:old-1_30-jdk8} (experiment id 2 vs.~7), but on a different scale as already discussed in Section \ref{sectEvaluationResultsDescriptive}. The interesting aspect is that version 1.1.0 on \ref{piId} (experiment id 1 vs.~6) seems to have less deviations (except for one particular treatment), while version 1.3.0 has as few deviations as version 1.3.0 on \ref{newLaptopId} (experiment id 4 vs.~7) in contrast to the respective results on \ref{oldLaptopId} (experiment id 1 vs.~7). We expect such a result for an isolated device such as a Pi 3~\cite{KnocheEichelberger18}, but, as stated above, not for Windows, which makes a detailed investigation of this phenomeon more appealing.
 
 \begin{figure}[!htb]
@@ -370,5 +361,5 @@
   \hfill
   \includegraphics[scale=\evalPDFscale]{figures/benchmark-results-pi-1_1_0-jdk1_8-20190308_evaluationTime_all.pdf}
-  \caption{Reasoner 1.10 on a Pi 3 using Oracle JDK 1.8.}\label{fig:pi-1_10-jdk8}
+  \caption{Reasoner 1.10 on \ref{piId} (Pi 3) using Oracle JDK 1.8.}\label{fig:pi-1_10-jdk8}
 \end{figure}
 
@@ -382,18 +373,8 @@
   \hfill
   \includegraphics[scale=\evalPDFscale]{figures/benchmark-results-pi-1_3_0-jdk1_8-20190308_evaluationTime_all.pdf}
-  \caption{Reasoner 1.30 on a Pi 3 using Oracle JDK 1.8. \TBD{still old}}\label{fig:pi-1_30-jdk8}
-\end{figure}
-
-\clearpage
-
-
-\TBD{Potential topics:
-
-\begin{itemize}
-    \item Artificial models of some size/variable/constraint ratio. Christian had a generator for that and Roman typically did some evaluations using these models. Compare to known results where available.
-    \item QualiMaster full, incremental, runtime reasoning (full, no defaults, no frozen, reuse constraint base). Runtime vs. full reasoning seems to behave rather linearly, at around 23 constraints evaluated per ms. Compare to known results where available.
-    \item All test cases involving reasoning as an overview map.
-    \item Eclipse 4.7/Java9 runs tests with asserts!!!
-\end{itemize}
-
-}
+  \caption{Reasoner 1.30 on \ref{piId} (Pi 3) using Oracle JDK 1.8.}\label{fig:pi-1_30-jdk8}
+\end{figure}
+
+\clearpage
+
+In summary, we conclude that reasoner version 1.3.0 is faster than version 1.1.0 while providing more IVML-completeness and additional features such as re-using the constraint base. Also it seems that version 1.3.0 leads to less spikes, which may also be due to the underlying setup. While the achieved runtime for full reasoning on "usual" devices such as \ref{oldLaptopId}, \ref{newLaptopId} or \ref{jenkinsId} is sufficient even for runtime adaptation scenarios~\cite{Eichelberger18}, applying EASy-Producer and the IVML reasoner in embedded settings is possible, but reasoning over large models in soft realtime settings is problematic (full reasoning up to 6s, incremental about 2s and instance-based reasoning  up to 1s). In this case, additional performance optimizations of the reasoner as well as the underlying model infrastructure as sketched in this report are required.
