Modeling sampling variability using a nested experimental design

A model for airflow quantity using multiple measurement tools and diagnostic procedures

This is a post about working with engineering data. Github here.

The more focused we are on a task, the more likely we are to think that our sample data is all there is. We may vaguely define our population and its parameters. But after a moment of pondering, we usually get back to work.

There are different ways of thinking about data. Usually, we think of nuts-and-bolts machine data as sample data.

This is me

This is pragmatic, but it leaves us at a loss for greater accuracy. This is an example about HVAC machines, which produce air. We can’t see air, but we can measure it.

Thinking like a Bayesian, I’m going to write up my model starting with the parameter I want to know, total_system_airflow. I’ll assume that it’s normally distributed due to being a function of many physical processes and laws.

The environment, whether you’re in a lab or uncontrolled setting, matters. The diagnostic tools have measurement error.

OK, I’ll backup. Anyway, here’s a workflow for figuring it all out.

We can’t measure total_system_airflow without introducing variance. There are lots of sources of variance, including that based on the measurement itself.

We can introduce system-level variance by measuring different aspects of machine performance. This can be done using built-in diagnostic machines. These can be compared to system-external measurements.

Measuring machine performance at multiple levels (machine i in environment j) adds yet more uncertainty from the environment. Adding a varying-intercept term allows for testing in different environments.

I’m going to code my model up in Stan. Non-centered re-parameterization works best.

This is me

sys-sig represents environmental variance. Most of the variance comes from the environment. sigma represents data from multiple measurements on the machine itself.

I love Bayesian statistics because it tells me the parameter I want to know. This is the population parameter for the system itself.

            mean se_mean     sd   2.5%    25%    50%    75%   97.5% n_eff Rhat
b0          0.13    0.11   4.87  -9.40  -3.27   0.19   3.52    9.52  2110 1.00
b1[1]     736.85    7.13 182.06 364.96 614.85 741.02 859.51 1094.83   652 1.01
b1[2]     599.41    5.36 145.70 315.17 502.22 599.18 697.40  884.46   740 1.00
b1[3]     789.73    6.20 168.34 464.11 679.87 791.09 904.09 1119.74   737 1.00
g0          0.00    0.02   1.01  -1.99  -0.70   0.00   0.65    1.97  1974 1.00
b_tesp     -0.03    0.02   1.01  -1.95  -0.71  -0.04   0.67    1.97  2045 1.00
b_sys_cfm   0.68    0.01   0.16   0.38   0.58   0.69   0.79    1.00   749 1.00
sys_sig    34.78    4.04  34.66   0.20   1.93  33.00  52.74  123.00    74 1.04
b_dl       -0.02    0.03   0.79  -1.55  -0.56  -0.03   0.51    1.55   748 1.00
sigma      26.68    4.24  38.78   0.12   0.98   4.74  43.99  118.07    84 1.03
lp__      -23.44    0.52   4.71 -33.61 -26.46 -22.74 -19.92  -15.69    81 1.06

The three systems in the b1[i] give coefficients for the true expectation of machine output. Other coefficients give variance and provide ranges in equipment calibration and performance effects.