Modeling the marketplace using the RStan ecosystem.
Author’s note: The data in this post has been 100% simulated, see the code here.
Customers come to us from different places, at different times, for different reasons. We do not control the market, however, it is incumbent on us to know our market to the fullest extent possible. This means knowing our customers, our relationship with them over time, and our broader reach in the community. The purpose of this project is to measure our marketing impact.
The outcome of interest- whether a customer chooses to work with us- exists at the appointment level, ie., did a particular sales call result in a closed deal?
The variables and values included in this example are:
Note: I’ll use the terms “feature,” “variable,” “indicator” and “coefficient” to refer to the above items.
The magnitude of the effect of each feature on the probability of a closed sale can be measured, and the variance partitioned, by using a regression framework.
The problem with regression is that, as more variables are added to a model, fewer observations fit into a particular category. This data sparsity tends to lead to a problem in regression known as over-fitting.
For example, there may be 50 appointments that resulted from a particular marketing campaign. Within this campaign, however, there may only be 5 appointments attributed to a particular combination of features.
As more variables are added to the model, low sample sizes for subgroups, class imbalances and an exponentially increasing number of subgroups leads to an increased likelihood that some feature effects are over-estimated.
In order to improve the estimates of the model coefficients, two additional modifications to the traditional logistic regression will be made.
First, the regression with be executed in a Bayesian framework. Weakly informative prior distributions will regularize the estimates to within a reasonable range. Priors prevent over-estimation by assuming that, for example, a marketing campaign would boost sales by 10-20% at most.
Second, where applicable, the variables will have multi-level structure. A multi-level variable partially pools estimates back to their average. For example, if there were a strong positive response to a particular radio commercial (but only a few observations), the estimate of that response would be reduced to more closely match the average effect of a radio commercial.
This analysis will be carried out in Stan, an open source probabilistic programming language that utilizes Hamiltonian Monte Carlo to computationally compute Bayesian integrals over the joint posterior distribution.
The RStan ecosystem, which can be called from the R computing environment, includes the following packages:
What follows is an overview of the analysis in a way that roughly corresponds to the packages above. To see the entire file executed in an R Notebook, see the full R notebook here.
The full model can be expressed in the following form:
\begin{align} L_i = Binomial(1, logit^{-1}[p_{i…}]) \newline p_{i…} = \alpha_{i…} + x \beta \newline \alpha_{i…} = N(0, \sigma_{i…}) \newline \beta = Cauchy(0, 2.5) \newline \sigma_{i…} = Cauchy(1, 25) \newline \end{align}
Where i represents a particular appointment and … represents the customer, lead source, year, city and sales professional multi-level variables. α represents varying intercepts for each combination of multi-level variables and β represents appointment-level coefficient for product and call_to_estimate.
As shown in the code below, RStanARM expresses multi-level models in a reduced form using lme4 syntax.
mlm_sim <- stan_glmer(
formula = close ~ (1|customer_id) +
(1|lead_source) +
(1|year) +
(1|city) +
(1|sales_professional) +
product +
n_service +
prior_revenue +
call_to_estimate - 1,
prior = normal(location = 0, scale = 1, autoscale = FALSE),
prior_intercept = cauchy(location = 0, scale = 2.5, autoscale = FALSE),
prior_aux = cauchy(location = 0, scale = 25, autoscale = FALSE),
data = dat,
family = binomial,
cores = 3,
chains = 4,
iter = 1000,
warmup = 150,
seed = 1)
Regularizing prior distributions were chosen and prior predictive distributions were constructed by utilizing the generative features of a Bayesian model.
In order to verify the superior goodnees-of-fit of the multi-level model, it was compared to a single-level Bayesian regression with identical priors.
The loo package uses leave-one-out cross validation to measure a model’s predictive accuracy for a single observation that is excluded from the model. It does this for all observations in the dataset.
Both models were compared on their expected log pointwise predictive density, and the multi-level model performed better.
For a number of reasons, models may fail to give accurate results unless they are correctly programmed. How you program the model is part of the model.
The easiest way to look “under the hood” is to do so in ShinyStan, a Shiny app template for a Stan model.
This model passed all diagnostics without any issues and has been shared on my Shiny dashboard.
Bayesplot makes numeric and graphical model inference easy by providing common functions for analysis of MCMC draws.
It utilizes the ggplot2 graphing library, which allows the analyst to fine tune the plot by adding text, labels, etc.
That’s a brief overview of the project. For more information, including the prior and posterior predictive distributions, see the full R notebook here.