Index: /reasoner/evaluation.tex
===================================================================
--- /reasoner/evaluation.tex	(revision 245)
+++ /reasoner/evaluation.tex	(revision 246)
@@ -16,5 +16,5 @@
 \begin{itemize}
   \item The original reasoner implementation that acted as basis for the revision. This base version\footnote{\label{reasonerBaseVersion}Git hash 6a00aa9c5aaa37ddb3d490d36c7e9a037e792656} is part of EASy-Producer release 1.1.0, i.e., we will call this original implementation \emph{reasoner v1.1.0}.
-  \item The revised reasoner implementation using the algorithms described in this document. This version\footnote{\label{reasonerActualVersion}Git hash e6cf7dcb850857cecfa4088434f0a717d16234e8. This version rather similar to EASy-Producer release 1.2.0, but includes some further improvements.} is part of EASy-Producer release 1.3.0, i.e., we will call it \emph{reasoner v1.3.0}.
+  \item The revised reasoner implementation using the algorithms described in this document. This version\footnote{\label{reasonerActualVersion}Git hash d2967af5385799f24c8fbcfacac0c396835f44ec. This version rather similar to EASy-Producer release 1.2.0, but includes some further improvements.} is part of EASy-Producer release 1.3.0, i.e., we will call it \emph{reasoner v1.3.0}.
 \end{itemize}
 
@@ -29,17 +29,31 @@
 \item an extended set of scenario test cases derived from the largest QualiMaster models in the EASy-Producer scenario test cases. These tests were specifically defined for this experiment as the gap in model size between the QualiMaster models and the other models was too big. The QualiMaster model consists of several imported projects, various user-defined types, a topological structure for defining Big Data processing pipelines and 16 configured pipelines. This corresponds roughly to 20.000 individual variables. In the set of test cases, we created models with the same type definitions but systematically varied the number of pipelines, i.e., created projected models with one pipeline, two pipelines etc. All these models contain only the required linked variables such as algorithms or data sources so that the respective model is structurally valid and its configuration is consistent.
 \end{itemize}
-We use these test cases as experimental treatments, although this involves test dependencies such as jUnit. While some of the test cases rely on programmed models (in terms of the IVML object model), most of the test cases specify the model in terms of IVML, i.e., for a more realistic setup and require for execution the IVML parser as well as dependent Eclipse libraries. As we focus on the reasoning time, the actual creation of the reasoning model shall not affect the results. 
-
-It is important to note that the code for reasoner v1.1.0 does not include several test cases that have been created for v1.3.0. For this experiment, we enable for v1.1.0 as many test cases as possible, i.e., we patch back\footnote{\label{fnPatch}Patch is available from \MISSING{XXX}.} the v1.3.0 test cases into v1.1.0 and, if required, either adjust the expected test result accordingly or, in extreme cases, disable test cases that cannot be handled by the v1.1.0 reasoner (or the underlying IVML implementation). For this reason, the treatment sets differ in terms of specific tests, while most of the imporant small and large models are the same. We believe that this is acceptable for an illustrative experiment. 
+We use these test cases as experimental treatments, although they involve test dependencies such as jUnit. While some of the test cases rely on programmed models (in terms of the IVML object model), most of the test cases specify the model in terms of IVML, i.e., for a more realistic setup and require for execution the IVML parser as well as dependent Eclipse libraries. As we focus on the reasoning time, the actual creation of the reasoning model shall not affect the results. 
+
+It is important to note that the code for reasoner v1.1.0 does not include several test cases that have been created for v1.3.0. For this experiment, we enable for v1.1.0 as many test cases as possible, i.e., we manually patched back\footnote{\label{fnPatch}Patch is available from \MISSING{XXX}.} the v1.3.0 test cases into v1.1.0 and, if required, either adjusted the expected test result accordingly or, in extreme cases, disable test cases that cannot be handled by the v1.1.0 reasoner (or the underlying IVML implementation). For this reason, the treatment sets differ in terms of specific tests, while most of the imporant small and large models are the same. We believe that this is acceptable for an illustrative experiment. 
 
 \subsubsection{Data Collection}\label{sectEvalSetupDataCollection}
 
-In the test cases mentioned above, we employ a generic measurement data collector, which can be feeded with key-value pairs representing measured (real) values. Collected values are stored when the data collection for a test case is finished. By default, the collector can automatically account for (wall) response time, which can help validating more detailed time measures collected during the execution of a treatment. 
-
-For the measurements in this evaluation, we include the default statistics collected by the SSE reasoner, i.e., translation time, evaluation time, number of failed constraints, number of re-evaluated constraints, model statistics, and complexity measures (cf. Section \ref{sectModelComplexity}) delivered by EASy-Producer. However, EASy-Producer release v1.1.0 did not contain the generic data collector and several measures, so, along with the test cases, we patched the related code from v1.3.0 back\footref{fnPatch} into v1.1.0. 
+In the test cases mentioned above, we employ the generic measurement data collector of EASy-Producer. The collector stores key-value pairs representing measured (real) values. A measurement  carries additional information to identify the test cases, including a tag (used to indicate the reasoning mode), the name of the involved IVML model, the name of the executing test case as well as the intra-experiment repetition count, i.e., the repetion of the same reasoning run within the test case. By default, the collector automatically accounts for (wall) response time, which can help validating more detailed time measures collected during the execution of a treatment. Collected values are stored along with the additional information when the data collection for a test case is finished. As default storage format we use tab-separated values, i.e., a column-table format, as this can easily be read by statistic frameworks such as R as well as Microsoft Excel. We use Excel in particular for gaining a quick overview of the colllected information.
+
+As specific measurements for this evaluation, we include 
+\begin{itemize}
+ \item the default statistics of the SSE reasoner, i.e., the time consumed by constraint translation (translation time), the time used for evaluating constraints (evaluation time), the time used for reasoning (including translation and constraint evaluation time), the number of failed constraints, and the number of re-evaluated constraints.
+ \item  statistics model about the underlying model as well as the complexity measure (cf. Section \ref{sectModelComplexity}) realized by EASy-Producer.
+\end{itemize}
+
+However, EASy-Producer release v1.1.0 does not contain the generic data collector and several measures, so, along with the test cases, we patched the related code from v1.3.0 back\footref{fnPatch} into v1.1.0. 
 
 \subsubsection{Experimental Procedure}\label{sectEvalSetupProcedure}
 
-We run each of the four test suites mentioned above 5 times to collect the measurements. Due to their use in continuous integration, each test suite is prepared to run in an own JVM instance. To compensate delayed JIT optimization, we include a ramp-up run that warms up the JVM. For most test cases, reasoning over a simple representative model including a compound type, a collection over that type and a quantor constraint over the container variable is sufficient. However, for the QualiMaster models, we added as ramp-up run a full run of one of the largest models without accounting for reasoning time or without performing code instantiation. We execute all tests suites in terms of an ANT script (based on existing mechanisms of the continuous integration) in the respective EASy-Producer standalone variant.  We execute the script on 
+During the experiment, we execute each of the test suites in Section \ref{sectEvalSetupTreatments}. Due to their use in continuous integration, each test suite runs in an own JVM instance. To compensate delayed just-in-time (JIT) compilation, we include a specific ramp-up test in the experimental runs to warm up the JVM.  For most test cases, reasoning over a simple representative model including a compound type, a collection over that type and a quantor constraint over the container variable seems to be sufficient. For the QualiMaster models, we added as ramp-up run a full run of one of the largest models without accounting for reasoning time. If the test cases include artifact instantiation trough VIL, we disable the instantiation phase.
+
+However, pilot experiments showed that still significant differences between the first runs of a test case and subsequent runs may occur. Thus, within each suite, we repeat the reasoning functionality of each test case 10 times on a fresh configuration. These 10 repetitions make up the intra-experiement repetitions. In particular, the repetitions allow for (later) excluding  warmup runs as well as for basic descriptive statistics such as confidence intervals \MISSING{rigorous}. 
+
+On a given machine/device, we first perform an initial run and then the experimental runs. During the initial run, we execute the test suites on the target device to validate that all tests are passed successfully. The measurements of this run are stored separately. Then, for taking the experimental measures, we repeate the execution of the test suites 5 times with 5 seconds pause between two subsequent runs.
+
+
+
+ We execute all tests suites in terms of an ANT script (based on existing mechanisms of the continuous integration) in the respective EASy-Producer standalone variant.  We execute the script on 
 \begin{itemize}
   \item an actual office computer, a Dell Latitude 7490 laptop with an Intel core i7 vPro 8th Gen, 32 GBytes RAM with Windows 10 and Oracle JDK 9 64 bit. 
Index: /reasoner/measures/script.r
===================================================================
--- /reasoner/measures/script.r	(revision 245)
+++ /reasoner/measures/script.r	(revision 246)
@@ -22,5 +22,4 @@
 names(replaceTags) <- c("orig", "subst")
 
-#http://www.cookbook-r.com/Graphs/Plotting_means_and_error_bars_(ggplot2)/
 
 # na-tolerant length function
@@ -32,4 +31,5 @@
 }
 
+#http://www.cookbook-r.com/Graphs/Plotting_means_and_error_bars_(ggplot2)/
 # statistical summary over x, returns mean, median, standard deviation, standard error and confidence interval  
 my.summary = function(x, conf.interval=.95) {
@@ -167,6 +167,16 @@
 # composes a diagram file name from a directory, a name and an indicator (name)
 my.composeFileName = function(dir, name, indicator) {
-  name <- paste(name, "_", indicator, ".pdf", sep="")
+  prefix <- my.lastPathSegment(dir)
+  if (length(prefix) > 0) {
+    prefix <- paste(prefix, "_", sep="")
+  }
+  name <- paste(prefix, name, "_", indicator, ".pdf", sep="")
   return (paste(dir, name, sep="/"))
+}
+
+#returns the last path segment of dir or an empty string
+my.lastPathSegment = function(dir) {
+  tmp <- str_match(dir, '.*/([^/]+)/?')
+  return (ifelse(is.na(tmp[1,2]), "", tmp[1,2]))
 }
 
@@ -211,6 +221,8 @@
   if (!is.na(str_match(d, "benchmark-results.*"))) {
     print(paste("processing folder ", d))
+    print(" loading data (1)")
     data <- my.readData(d, 1)
     my.createDiagrams(data, d, 1)
+    print(" loading data (all)")
     data <- my.readData(d)
     my.createDiagrams(data, d, "all")
Index: /reasoner/performance.tex
===================================================================
--- /reasoner/performance.tex	(revision 245)
+++ /reasoner/performance.tex	(revision 246)
@@ -1,7 +1,7 @@
 \section{Performance Considerations}\label{sectPerformance}
 
-In this section, we discuss some observations made during the revision of the reasoner and the preparation of this report. We document these observations here as a justification for implementation decisions, but also to avoid that future work accidentally violates undocumented performance improvements. Baseline for this discussion is the version from January 2018. Although some considerations may appear rather specific, they typically stem from more general Java performance optimization knowledge typically hidden by the context of the reasoner, e.g., in constraints or model data structures. 
+In this section, we discuss some observations made during the revision of the reasoner and the preparation of this report. We document these observations here as a justification for implementation decisions, but also to avoid that future work accidentally violates undocumented performance improvements. Although some considerations may appear rather specific, they stem from Java performance optimization knowledge. 
 
-The performance measures reported in this section are just illustrative, i.e., we did not perform systematic measurements to isolate the individual effects while revising the reasoner.
+The performance measures reported in this section are just illustrative, i.e., we did not perform systematic measurements to isolate the individual effects while revising the reasoner. In particular, we observed the full reasoning time, i.e., the response time calling a propagation oeration for specific (challenging) test cases executed in the development environment Eclipse. In this context, the initial reasoning time achieved by the reasoner version from January 2018 for the QualiMaster (cf. Section \ref{sectEvaluationSetup}) model was around 3200 ms and dropped due to the optimizations to 500 ms.
 
 \begin{itemize}