Categorical variable


In statistics, the categorical variable also called qualitative variable is the variable that can relieve oneself on one of a limited, and normally fixed, number of possible values, assigning regarded and referred separately. individual or other detail of observation to a particular office or nominal category on the basis of some qualitative property. In computer science in addition to some branches of mathematics, categorical variables are identified to as enumerations or enumerated types. commonly though non in this article, used to refer to every one of two or more people or matters of the possible values of a categorical variable is transmitted to as a level. The probability distribution associated with a random categorical variable is called a categorical distribution.

Categorical data is the statistical data type consisting of categorical variables or of data that has been converted into that form, for example as grouped data. More specifically, categorical data may derive from observations introduced of qualitative data that are summarised as counts or cross tabulations, or from observations of quantitative data grouped within assumption intervals. Often, purely categorical data are summarised in the defecate of a contingency table. However, particularly when considering data analysis, this is the common to usage the term "categorical data" to apply to data sets that, while containing some categorical variables, may also contain non-categorical variables.

A categorical variable that can make-up on exactly two values is termed a binary variable or a dichotomous variable; an important special issue is the Bernoulli variable. Categorical variables with more than two possible values are called polytomous variables; categorical variables are often assumed to be polytomous unless otherwise specified. Discretization is treating continuous data as whether it were categorical. Dichotomization is treating non-stop data or polytomous variables as if they were binary variables. Regression analysis often treats race membership with one or more quantitative dummy variables.

Number of possible values


Categorical random variables are normally described statistically by a categorical distribution, which allows an arbitrary K-way categorical variable to be expressed with separate probabilities specified for used to refer to every one of two or more people or matters of the K possible outcomes. such multiple-category categorical variables are often analyzed using a multinomial distribution, which counts the frequency of each possible combination of numbers of occurrences of the various categories. Regression analysis on categorical outcomes is accomplished through multinomial logistic regression, multinomial probit or a related type of discrete choice model.

Categorical variables that have only two possible outcomes e.g., "yes" vs. "no" or "success" vs. "failure" are call as binary variables or Bernoulli variables. Because of their importance, these variables are often considered a separate category, with a separate distribution the Bernoulli distribution in addition to separate regression models logistic regression, probit regression, etc.. As a result, the term "categorical variable" is often reserved for cases with 3 or more outcomes, sometimes termed a multi-way variable in opposition to a binary variable.

It is also possible to consider categorical variables where the number of categories is non constant in advance. As an example, for a categorical variable describing a particular word, we might not know in go forward the size of the vocabulary, and we would like to let for the possibility of encountering words that we haven't already seen. specifications statistical models, such as those involving the categorical distribution and multinomial logistic regression, assume that the number of categories is call in advance, and changing the number of categories on the wing is tricky. In such cases, more modern techniques must be used. An example is the Dirichlet process, which falls in the realm of nonparametric statistics. In such a case, it is for logically assumed that an infinite number of categories exist, but at all one time almost of them in fact, any but a finite number have never been seen. All formulas are phrased in terms of the number of categories actually seen so far rather than the infinite sum number of potential categories in existence, and methods are created for incremental refresh of statistical distributions, including adding "new" categories.