Index: /reasoner/evaluation.tex
===================================================================
--- /reasoner/evaluation.tex	(revision 259)
+++ /reasoner/evaluation.tex	(revision 260)
@@ -35,5 +35,5 @@
 \subsubsection{Data Collection}\label{sectEvalSetupDataCollection}
 
-In the test cases mentioned above, we employ the generic measurement data collector of EASy-Producer. The collector stores key-value pairs representing measured (real) values. A measurement  carries additional information to identify the test cases, including a tag (used to indicate the reasoning mode), the name of the involved IVML model, the name of the executing test case as well as the intra-experiment repetition count, i.e., the repetion of the same reasoning run within the test case. By default, the collector automatically accounts for (wall) response time, which can help validating more detailed time measures collected during the execution of a treatment. Collected values are stored along with the additional information when the data collection for a test case is finished. As default storage format we use tab-separated values, i.e., a column-table format, as this can easily be read by statistic frameworks such as R as well as Microsoft Excel. We use Excel in particular for gaining a quick overview of the colllected information.
+In the test cases mentioned above, we employ the generic measurement data collector of EASy-Producer. The collector stores key-value pairs representing measured (real) values. A measurement  carries additional information to identify the test cases, including a tag (used to indicate the reasoning mode), the name of the involved IVML model, the name of the executing test case as well as the intra-experiment repetition count, i.e., the repetition of the same reasoning run within the test case. By default, the collector automatically accounts for (wall) response time, which can help validating more detailed time measures collected during the execution of a treatment. Collected values are stored along with the additional information when the data collection for a test case is finished. As default storage format we use tab-separated values, i.e., a column-table format, as this can easily be read by statistic frameworks such as R as well as Microsoft Excel. We use Excel in particular for gaining a quick overview of the collected information.
 
 As specific measurements for this evaluation, we include 
@@ -49,17 +49,18 @@
 During the experiment, we execute each of the test suites in Section \ref{sectEvalSetupTreatments}. Due to their use in continuous integration, each test suite runs in an own JVM instance. To compensate delayed just-in-time (JIT) compilation, we include a specific ramp-up test in the experimental runs to warm up the JVM.  For most test cases, reasoning over a simple representative model including a compound type, a collection over that type and a quantor constraint over the container variable seems to be sufficient. For the QualiMaster models, we added as ramp-up run a full run of one of the largest models without accounting for reasoning time. If the test cases include artifact instantiation trough VIL, we disable the instantiation phase.
 
-However, pilot experiments showed that still significant differences between the first runs of a test case and subsequent runs may occur. Thus, within each suite, we repeat the reasoning functionality of each test case 10 times on a fresh configuration. These 10 (here intuitively chosen) repetitions make up the intra-experiement repetitions. In particular, the repetitions allow for (later) excluding  warmup runs as well as for basic descriptive statistics such as confidence intervals \cite{GeorgesBuytaertEeckhout07}. 
+However, pilot experiments showed that still significant differences between the first runs of a test case and subsequent runs may occur. Thus, within each suite, we repeat the reasoning functionality of each test case 10 times on a fresh configuration. These 10 (here intuitively chosen) repetitions make up the intra-experiment repetitions. In particular, the repetitions allow for (later) excluding  warmup runs as well as for basic descriptive statistics such as confidence intervals \cite{GeorgesBuytaertEeckhout07}. 
 
 On a given machine/device, we first perform an initial run and then the experimental runs. During the initial run, we execute the test suites on the target device to validate that all tests are passed successfully. The measurements of this run are stored separately. For taking the experimental measures, we repeate the execution of the test suites 5 times (inter-experiment repetition) with 5 seconds pause between two subsequent runs.
 
-Executing the test suites is realized as an ANT script, because collecting all dependencies for the standalone version of EASy-Producer is not trivial. For this task, we reuse the respective part of the ANT build mechanism from the continuous integration. ANT supports executing jUnit test suites based on a classpath constructed from the dependencies, so executing the test suites described in Section \ref{sectEvalSetupTreatments} is rather straightforward. The inclusion of the ramp-up tests, the intra-experiment and inter-experiment repetitions as well as the waiting time are configured in the ANT script and passed to the EASy-Producer test suites via environment parameters. For separating the intial and the experimental runs, the ANT script defines two specific tasks. In turn, the ANT script itself can be used as build action in the continuous integration, i.e., for collecting performance readings on the continuous integration server, e.g., for detecting performance degradation.
-
-We execute the ANT script, i.e., the experiments, on 
-\begin{itemize}
-  \item an actual office computer: Dell Latitude 7490 laptop with Intel core i7-8650 U processor (4 physical cores at 1.9 GHz) , 32 GBytes RAM with Windows 10 Professional (10.0.17134) and Open JDK 10.0.2 64 bit. Besides Maven 3.2.3, ANT 1.10.3 with Maven-ANT-tasks 2.1.3, an Eclipse Oxygen 3a release 4.7.3a are installed. 
-  \item a retired office computer used for the reason reveisions: Dell Latitude 6430u laptop with Intel core i7-3367U (2 physical cores at 2.00GHz), Windows 7 version 6.1.761 SP 1 and Oracle JDK 1.8.0\_66 64 bit. Besides Maven 3.2.3, ANT 1.10.3 with Maven-ANT-tasks 2.1.3, an Eclipse Mars2 release 4.5.2 are installed.
-  \item our continuous integration server, a Ubuntu Linux 16.4.5 LTS VM with 4 GBytes RAM and OpenJDK 8 64bit. The measurement script is integrated as a manual build action that prevents any other continous build action running at the same time.
-  \item a Raspberry Pi 3 by vendor element14 hosting an 8 GB SanDisk class-4 SD (one from \cite{KnocheEichelberger18}) with Raspbian Stretch Lite version November 2018, Linux Kernel 4.14 and Oracle JDK 1.8.\TBD{65} ARM and, as alternative for some experiments, Oracle JDK 1.8.0\_201 ARM.
-\end{itemize}
+Executing the test suites is realized as an ANT script, because collecting all dependencies for the standalone version of EASy-Producer is not trivial. For this task, we reuse the respective part of the ANT build mechanism from the continuous integration. ANT supports executing jUnit test suites based on a classpath constructed from the dependencies, so executing the test suites described in Section \ref{sectEvalSetupTreatments} is rather straightforward. The inclusion of the ramp-up tests, the intra-experiment and inter-experiment repetitions as well as the waiting time are configured in the ANT script and passed to the EASy-Producer test suites via environment parameters. For separating the initial and the experimental runs, the ANT script defines two specific tasks. In turn, the ANT script itself can be used as build action in the continuous integration, i.e., for collecting performance readings on the continuous integration server, e.g., for detecting performance degradation.
+
+We execute the experiment on the following devices:
+\begin{enumerate}
+  \renewcommand{\theenumi}{D\arabic{enumi}}
+  \item\label{oldLaptopId} Retired office computer used for developing reasoner v1.3.0: Dell Latitude 6430u laptop with Intel core i7-3367U (2 physical cores at 2.00GHz), Windows 7 version 6.1.761 SP 1 and Oracle JDK 1.8.0\_66 64 bit. Besides Maven 3.2.3, ANT 1.10.3 with Maven-ANT-tasks 2.1.3, an Eclipse Mars2 release 4.5.2 are installed.
+  \item\label{newLaptopId} Actual office laptop: Dell Latitude 7490 laptop with Intel core i7-8650 U processor (4 physical cores at 1.9 GHz), 32 GBytes RAM with Windows 10 Professional (10.0.17134) and Open JDK 10.0.2 64 bit. Besides Maven 3.2.3, ANT 1.10.3 with Maven-ANT-tasks 2.1.3, an Eclipse Oxygen 3a release 4.7.3a are installed. 
+  \item\label{jenkinsId} Our continuous integration server, a Ubuntu Linux 16.4.5 LTS VM with 4 GBytes RAM and OpenJDK 8 64bit. The measurement script is integrated as a manual build action that prevents any other continuous build action running at the same time.
+  \item\label{piId} A Raspberry Pi 3 by vendor element14 hosting an 8 GB SanDisk class-4 SD (a device used in \cite{KnocheEichelberger18}) with Raspbian Stretch Lite version November 2018, Linux Kernel 4.14 and Oracle JDK 1.8.\TBD{65} ARM and, as alternative for some experiments, Oracle JDK 1.8.0\_201 ARM.
+\end{enumerate}
 For both windows machines, we terminate all programs that are not required for the execution of the tests (leaving the virus scanner in operation as usual during development). 
 
@@ -68,5 +69,5 @@
 After collecting the runtime measurements, we execute an R script which combines the results of all test executions for one complete run, calculates statistical measures and draws summarizing diagrams. To analyze the initial runs, the script reads the individual files created for the test suites from Section \ref{sectEvalSetupTreatments}, removes the first three runs per test case intra-experiment repetition, calculates average, mean, variance and confidence interval per test case (cf. \cite{BulejHorkTuma17} for validity) and join all results into one large table. Although the inter-experiment repetitions can be seen as individual experiments, they are executed in direct sequence on the same machine. Thus, we also follow \cite{BulejHorkTuma17} here and create in this case a statistical summary table over all runs resulting in a similar large table.
 
-Finally, zhe script produces various plots, all relating runtime (reasoning time, translation time or evaluation time) and model complexity (cf. Section \ref{sectModelComplexity}). These plots are inteded to provide an overview on the reasoning time for different modes, i.e., to discuss how the reasoning time is composed and some indicating the deviations across all series of test suite execution.
+Finally, zhe script produces various plots, all relating runtime (reasoning time, translation time or evaluation time) and model complexity (cf. Section \ref{sectModelComplexity}). These plots are intended to provide an overview on the reasoning time for different modes, i.e., to discuss how the reasoning time is composed and some indicating the deviations across all series of test suite execution.
 
 \subsection{Model Ranking and Complexity}\label{sectModelComplexity}
@@ -202,21 +203,26 @@
 \subsection{Results}\label{sectEvaluationResults}
 
+\newcommand\centerCell[1]{\multicolumn{1}{|c|}{#1}}
 \begin{table*}[ht]
 %\begin{adjustbox}{angle=90}
 \centering
-\begin{tabular}{|l|l|l|l|l|l|l|}
+\begin{tabular}{|c|c|c|r|r|r|r|r|r|r|}
 \hline
-\textbf{Device} &  \textbf{ver} & \textbf{JDK} & \textbf{tests} & \textbf{reasoning [ms]} & \textbf{Constraints} & \textbf{Re-evaluations}\\
+                          &                       &                       & \textbf{tests}                    & \multicolumn{2}{|c|}{\textbf{reasoning}} & \multicolumn{2}{|c|}{\textbf{constraints}} & \multicolumn{2}{|c|}{\textbf{evaluations}} \\
+\textbf{device} & \textbf{ver}  & \textbf{JDK} & \centerCell{\textbf{[\#]}} & \multicolumn{2}{|c|}{\textbf{time [ms]}} & \multicolumn{2}{|c|}{\textbf{[\#]}} &
+\multicolumn{2}{|c|}{\textbf{[\#]}}\\
+                          &                       &                       &                                            & \textbf{avg} & \textbf{max}                     & \textbf{avg} & \textbf{max} & 
+\textbf{avg} & \textbf{max}\\
 \hline
-Latitude 6430u & 1.1.0 & Oracle 1.8 & 400 & 16.4 / 655.1 & 1714.7 / 24110 & 1092.5 / 46469.1 \\
-Latitude 6430u & 1.3.0 & Oracle 1.8 & 433 & 11.8 / 399.1 & 52.6 / 6000 & 603.3 / 23696\\
-Latitude 7490   & 1.3.0 & Open 1.8 & 433 & 13.4 / 424.0 & 52.6 / 6000 & 603.3 / 23696\\
-Latitude 7490   & 1.3.0 & Open 10 & 433 & 14.9 / 484.4 & 52.6 / 6000 & 603.3 / 23696\\
-Jenkins             & 1.3.0 &  Open 8 & 427 & 18.9 / 562.1 & 53.3 / 6000 & 709.3 / 23703\\
-Pi 3                   & 1.1.0 & Oracle 1.8 & 400 & 958.1 / 28651.7 & 977.6 / 24110 & 1679.0 / 46487.8\\
-Pi 3                   & 1.3.0 & Oracle 1.8 & 433 & 255.0 / 6117.3 & 52.6 / 6000 & 815.3 / 23708.3\\
+\ref{oldLaptopId}   & 1.1.0 & 1.8.0   & 400 & 16 & 655       & 723   & 24110   & 1092 & 46469 \\
+\ref{oldLaptopId}   & 1.3.0 & 1.8.0   & 433 & 11 & 399       & 53     & 6000     & 603   & 23696\\
+\ref{newLaptopId} & 1.3.0 & 1.8.0* & 433 & 13 & 424       & 53     & 6000     & 603   & 23696\\
+\ref{newLaptopId} & 1.3.0 & 10*      & 433 & 15  & 484      & 53     & 6000     & 603   & 23696\\
+\ref{jenkinsId}        & 1.3.0 & 1.8.0   & 427 & 19  & 562      & 53     & 6000     & 709   & 23703\\
+\ref{piId}                & 1.1.0 & 1.8.0   & 400 & 958 & 28651 & 978   & 24110   & 1679 & 46487\\
+\ref{piId}                & 1.3.0 & 1.8.0   & 433 & 255 & 6117   & 53     & 6000     & 815   & 23708\\
 \hline
 \end{tabular}
-\caption{Descriptive summary of experiments.}
+\caption{Descriptive summary of experiments (* = OpenJDK).}
 \label{tab:experimentsDescSummary}
 %\end{adjustbox}
@@ -227,4 +233,5 @@
 Table \ref{tab:experimentsDescSummary}: 
 \begin{itemize}
+  \item min always 0, average of averages, similar max but left out digits
   \item reasoner 1.3.0 is faster and produces less constraints/re-evaluations although being more complete. 
   \item reasoner 1.1.0 produces ineffective and less complex constraints that can be evaluated faster