# Who invented the binary tree algorithm?

## Who invented the decision tree?

I'm trying to keep track of who invented the data structure and decision tree algorithm.

The Wikipedia entry on decision tree learning states that "ID3 and CART were invented independently around the same time (between 1970 and 1980)". ID3 was later featured in:

• Quinlan, JR 1986. Induction of Decision Trees. Do. Learn. 1, 1 (March 1986), 81-106

So I'm not sure the claim is true.

I found a reference to a statistical decision series from 1959 and a collection of working papers from 1958 using google books. The context is not clear and they don't seem to present an algorithm. However, they do not define the data structure and treat it as known.

Using Google Scholar, I found quotes from 1853, but they were analytical errors and not real quotes from that date.

Good question. @ G5W is on the right track to refer to Wei-Yin Loh's article. Loh's article discusses the statistical Prehistory of decision trees and correctly traces their location back to Fisher's (1936) article on discriminant analysis - essentially regression that classifies multiple groups as dependent variables - and from there via AID, THAID, CHAID, and CART models.

The short answer is that the first article I could find developing a "decision tree" approach was in 1959, and British researcher William Belson in an article entitled " Matching and Prediction on the Principle of Biological Classification " ( JRSS) , Series C, Applied Statistics, Vol. 8, No. 2, June 1959, pp. 65-75), the summary of which his approach as one of the matching Describes samples of populations and develops criteria for them:

In this article, Dr. Belson developed a technique for matching population samples. This depends on the combination of empirically developed predictors to obtain the best predictive or matching composition available. The underlying principle differs significantly from that of the multiple correlation method.

The "long" answer is that other, even earlier streams of thought appear relevant here. For example, the simple age-gender cohort breakouts used in actuarial life tables provide a framework for reflection on decisions that go back several centuries. One could also argue that efforts dating back to the Babylonians used quadratic equations that were nonlinear in the variables (not in the parameters http://www-history.mcs.st-and.ac.uk/HistTopics/Quadratic_etc_equations ). html) are relevant, at least insofar as they provide parametric models of logistic growth (I recognize that this is a Route isComment, please read on for a more complete rationale. In addition, philosophers have long recognized and theorized the existence of hierarchically ordered, qualitative information, e.g. B. Aristotle's book on Categories . The concept and adoption of a hierarchy are crucial here. Other relevant, much later discoveries concerned the crossing of the boundaries of Euclidean 3-D space in David Hilbert's development of infinite, HilbertSpace, combinatorics, discoveries in physics related to 4-D Minkowski space, distance and time, the statistical mechanics behind Einstein's theory of special relativity, as well as innovations in probability theory regarding models of Markov chains, transitions and processes. The point here is that there can be a significant delay between any theory and its application - in this case, the delay between theories of qualitative information and developments in terms of their empirical evaluation, prediction, classification, and modeling.

A good guess is that these developments can be linked to the history of the increasing refinement of statisticians, mostly in the 20th century, by developing models other than continuous scale types (e.g. nominal or simply categorical information). and use counting data models (poisson), cross-classification contingency tables, nonparametric nonparametric statistics, multidimensional scaling (e.g. JG Carroll), models with qualitative dependent variables such as B. Two-group logistic regression and correspondence analysis (mainly in Holland and France in the 1970s and 1980s).

There is a large body of literature in which two group logistic regressions are discussed and compared with two group discriminatory analyzes and which provide equivalent solutions for completely nominal characteristics (e.g. Dillon and Goldsteins Multivariate Analysis , 1984).

JS Cramer's article on the history of logistic regression ( The History of Logistic Regression , http://papers.tinbergen.nl/02119.pdf) describes it as the origin of the development of the univariate, logistic function or the classical S-shaped Curve :

Survival of the term logistics and the wide application of the device were largely determined by the personal histories and individual actions of some scholars ...

Deterministic models of the logistics curve emerged in 1825 when Benjamin Gompertz (https://en.wikipedia.org/wiki/Benjamin_Gompertz) published a paper in which the first truly non-linear logistic model (not linear in parameters and not only in variables such as with) the Babylonians) - the Gompertz model and the curve.

I would suggest that another important link in this chain that led to the invention of decision trees was the work of Colombia sociologist Paul Lazarsfeld on latent structural models. His work began in the 1930s, advocating for the emerging OSS (later the CIA, as in John Naisbett's book) with his content analysis of German newspapers during World War II Megatrends discussed) and was finally published in 1950. Andersen describes it like this ( Latent Structure Analysis: An Overview , Erling B. Andersen, Scandinavian Journal of Statistics , Vol. 9, No. 1, 1982, pp. 1-12):

The basis for the classical theory of latent structural analysis was developed in 1950 by Paul Lazarsfeld in a study of the ethnocentrism of American soldiers during World War II. Lazarsfeld was primarily interested in developing the conceptual basis for latent structural models ... However, the statistical methods developed by Lazarsfeld were rather primitive ... An early attempt to derive efficient estimation methods and test procedures was made by Lazarsfeld's colleagues at Columbia University undertaken, TW Anderson, who in an essay ( Psychometrics , March 1954, Volume 19, Issue 1, pp. 1–10, For estimating parameters in latent structure analysis), developed an efficient estimation method for the parameters of the latent class model ... To introduce the framework (of the latent class models) we will briefly outline the basic concepts ... and use a notation system developed much later by Goodman (1974a) ... the data will in form of a Multiple contingency table specified ...

It's worth making a useful distinction here, as this is related to the transition from AID to CHAID (later CART), between contingency table-based models (all variables in the model are nominally scaled) and newer models of latent classes (more) precise, finite mixture models based on "mixtures" of scales and distributions, e.g. B. Kamakura and Russell, 1989, A probabilistic selection model for market segmentation and elasticity structure) how they generate the residuals of the model. In the older contingency table models, the cell numbers contained in the fully cross-classified table formed the basis for the "replications" and therefore the heterogeneity of the model residues used in the subdivision into classes. On the other hand, the newer mixture models rely on repeated measurements across a single subject as the basis for dividing the heterogeneity into the residuals. That answer is NotPropose a direct link between latent class models and decision trees. The relevance for AID and CHAID can be summarized in the statistics used to evaluate the models. AID uses a continuous F distribution, while CHAID uses the chi-square distribution, which is appropriate for categorical information. In my opinion, LCMs form an important piece of the puzzle or narrative that leads to the development of decision trees in the analysis and modeling of contingency tables, along with the many other innovations already mentioned.

CHAID was a later development first suggested in a 1980 dissertation by South African Gordon Kass, as described in this wiki article on CHAID (https://en.wikipedia.org/wiki/CHAID). Of course, CART came along a few years later in the 1980s with Breiman et al. Out, the now famous book Classification and Regression Trees .

AID, CHAID and CART all use tree-like, hierarchically arranged structures as an optimal representation of reality. They just go about it with different algorithms and methods. For me, the next steps in this advancing innovation chain are the emergence of heterarchical structural theories. As defined in this wiki article, heterarchies are "an organizational system in which the elements of the organization are not ranked (non-hierarchical) or have the potential to be ranked in various ways" (https: // en.wikipedia) .org / wiki / heterarchy or for a deeper, more philosophical perspective of heterarchy see Kontopoulos, The Logics of Social Structure). From an empirical point of view, the analysis and modeling of network structures are most representative of this historical development in the understanding of structures (e.g. Freeman's book The Development of Social Network Analysis ). While many network analysts will attempt to impose a hierarchical arrangement on the resulting network, this is more an expression of ingrained and unconscious assumptions than a statement about the empirical reality of the multiplexed network structure in a complex world.

This answer suggests that the evolutionary arc that led to the development of decision trees raises new questions or dissatisfaction with existing "state-of-the-art" methods and requires new solutions and models at each step or phase of the process. In this case, dissatisfaction can be seen in the limitations of modeling two groups (logistic regression) and the recognition of the need to extend this framework to more than two groups. Dissatisfaction with non-representative assumptions of an underlying normal distribution (discriminant analysis or AID) as well as comparison with the relative "freedom" when using non-parametric, non-distribution assumptions and models (e.g. CHAID and CART).

As mentioned earlier, the origin of decision trees almost certainly has a long history that goes back centuries and is geographically dispersed. Multiple currents in history, science, philosophy, and human thought can be traced to outline the narrative that gave rise to the many types of decision trees that exist today. I will be the first to recognize the significant limitations of my brief sketch of this story.

/ ** Supplements ** /

1. This article in New Scientist 2014 bears the title Why do we love to organize knowledge in trees? (https://www.newscientist.com/article/mg22229630-800-why-do-we-love-to-organise-knowledge-into-trees/). It is an overview of the book The Book of by data visualization guru Manuel Lima Trees, which trace the millennia-old use of trees as a visualization and memory aid for knowledge. There seems to be no question that the secular and empirical models and graphs inherent in methods such as AID, CHAID, and CART represent the evolution of this originally religious tradition of classification.

2. In this video (posted online by Salford Systems, implementer of CART software), A tribute to Leo Breiman , Breiman talks about the development of his thinking that led to the CART methodology. It all started with a wall clad in the silhouettes of various WWII battleships.

1. In the introduction to Denis Konigs Theory of finite and infinite graphs of 1936, widely regarded as the first rigorous mathematical foundation for a field previously considered a source of entertainment and riddles for children, Tutte notes in this chapter (p. 13) 4 (from p. 62) of Konig's book is dedicated to trees in graph theory. Tutte's explanation for Konig's definition of a tree is: "Where an 'acyclic' graph is a graph without a circle, a tree is a finite connected acyclic graph ... in other words, in a tree there is one and only one path from one" To me (and I am neither a graph theorist nor a mathematician) this suggests that graph theory and its precursors in Poincares Analysis situs or Veblen's lectures on combinatorial topology may have provided the first intellectual and mathematical foundations for what later became a subject for statisticians.

2. The first Tree of knowledge is largely attributed to the Neoplatonic philosopher Porphyry, who around 270 AD Introduction to Logic , in which knowledge was described and organized using a metaphorical tree ... http://www.historyofinformation.com/expanded.php? id = 3857

3. Just got an even earlier reference to one Tree of knowledge Discovered in the Book of Genesis in the Bible, which is covered in this Wiki article ... https://en.wikipedia.org/wiki/Tree_of_life_(biblical). Genesis probably dates from 1,400 BC. Based on this reference ... https://www.biblica.com/bible/bible-faqs/when-was-the-bible-written/ Regardless, the Book of Genesis came porphyry many centuries before.

The big reference on CART is:

Classification and regression trees
Leo Breiman, Jerome Friedman, Charles J. Stone, RA Olshen (1984)

but that was certainly not the earliest work on the subject.

In his 1986 essay Induction of Decision Trees, Quinlan himself identified Hunt's Concept Learning System (CLS) as the forerunner of ID3. It dates CLS to around 1963 but refers to references

EB Hunt, J. Marin, PJ Stone,
Experiments in Induction