What is a Market Basket?

In economics, a market basket is a fixed collection of items that consumers buy.  This is used for metrics like CPI (inflation) etc.  In marketing, a market basket is any 2 or more items bought together.

Market basket analysis is used, especially in retail / CPG, to bundle and offer promotions and gain insight in shopping / purchasing patterns.  “Market basket analysis” does not, by itself, describe HOW the analysis is done.  That is, there is no associated technique with those words.

How is it usually done?

There are three general uses of data: descriptive, predictive and prescriptive.  Descriptive is about the past, predictive uses statistical analysis to calculate a change on an output variable (e.g., sales) given a change in an input variable (say, price) and prescriptive is a system that tries to optimize some metric (typically profit, etc.)  Descriptive data (means, frequencies, KPIs, etc.) is a necessary but not usually a sufficient step.  Always get to at least the predictive step as soon as possible.  Note that predictive here does not necessarily mean forecast-ed into the future.  Structural analysis uses models to simulate the market, and estimate (predict) what causes what to happen.  That is, using regression, given a change in price what is the estimated (predicted) change in sales.

Market basket analysis often uses descriptive techniques.  Sometimes it is just a “report” of what percent of items are purchased together.  Affinity analysis (a step above) is mathematical, not statistical.  Affinity analysis simply calculates the percent of time combinations of products are purchased together.  Obviously there is no probability involved.  It is concerned with the rate of products purchased together, and not with a distribution around that association.  It is very common and very useful but NOT predictive–therefore NOT so actionable.

Logistic Regression

Let’s talk about logistic regression.  This is an ancient and well known statistical technique, probably the analytic pillar upon which database marketing has been built.  It is similar to ordinary regression in that there is a dependent variable that depends on one or more independent variables.  There is a coefficient (although interpretation is not the same) and there is a (type of) t-test around each independent variable for significance.

The differences are that the dependent variable is binary in logistic and continuous in ordinary regression and to interpret the coefficients requires exponentiation.  Because the dependent variable is binary, the result is heteroskedasticity.  There is no (real) R2, and “fit” is about classification.

How to Estimate / Predict the Market Basket

The use of logistic regression in terms of market basket becomes obvious when it is understood that the predicted dependent variable is a probability.  The formula to estimate probability from logistic regression is:

P(i) = 1 / 1+ e –Z

where Z = α + βXi.  This means that the independent variables can be products purchased in a market basket to predict likelihood to purchase another product as the dependent variable.   The above means specifically take each (major) category of product (focus driven by strategy) and running a separate model for each, putting in all significant other products as independent variables.  For example, say we have only three products, x, y and z.  The idea is to design three models and test significance of each.  Meaning using logistic regression:

x = f(y,z)

y = f(x,z)

z = f(x,y).

Of course other variable can go into the model as appropriate but the interest is whether or not the independent (product) variables are significant in predicating the probability of purchasing the dependent product variable.  Of course, after significance is achieved, the insights generated are around the sign of the independent variable, i.e., does the independent product increase or decrease the probability of purchasing the dependent product.

An Example

As a simple example, say we are analyzing a retail store, with categories of products like consumer electronics, women’s accessories, newborn and infant items, etc.  Thus, using logistic regression, a series of models should be run.  That is,


This means the independent variables are binary, coded as a “1” if the customer bought that category and a “0” if not.  The table below details the output for all of the models.  Note that other independent variables can be included in the model, if significant.  These would often be seasonality, consumer confidence, promotions sent, etc.

To interpret, look at say home décor model.  If a customer bought consumer electronics, that increases the probability of buying home décor by 29%.  If a customer bought newborn / infant items, that decreases the probability of buying home décor by 37%.  If a customer bought furniture, that increases the probability of buying home décor by 121%.  This has implications


CONSUMER ELECTRONICS XXX Insig Insig -23% 34% 26% 98%
WOMEN’S ACCESSOR Insig XXX 39% 68% 22% 21% Insig
NEWBORN, INFANT,ETC. Insig 43% XXX -11% -21% -31% 29%
JEWELRY, WATCHES -29% 71% -22% XXX 12% 24% -11%
FURNITURE 31% 18% -17% 9% XXX 115% 37%
HOME DÉCOR 29% 24% -37% 21% 121% XXX 31%
ENTERTAIN 85% Insig 31% -9% 41% 29% XXX


especially for bundling and messaging.  That is, offering say home décor and furniture together makes great sense, but offering home décor and newborn / infant items does not make sense.


The above detailed a simple (and more powerful way) to do market basket analysis.  If given a choice, always go beyond mere descriptive techniques and apply predictive techniques.

See my MARKETING ANALYTICS for additional details: