More Advanced Regression Models
For many projects, OLS regression will suffice. However, for more advanced work, different types of dependent variables generally call for different types of regression models. This page presents a very rudimentary overview of some other options.
Many outcomes of interest are dichotomous: effectively, the categories fall into something and not-something. For these types of dependent variables, logit and probit regression models are a better fit than OLS models. The Stata commands are simple (instead of starting with "reg," you start with "logit" or "probit"), but the interpretation is a bit different.
In the vast majority of cases, logit and probit models give substantively similar results. This tutorial proceeds using logit models, but similar logic would apply if you wanted to use probit instead.
Let's say you're really interested in whether people identify as Democrats or not: that is, you aren't interested in the slight degrees of variation, but only in a dichotomous sense of Democrat/not-Democrat. We can make the party identification variable a Democratic dummy variable and use logistic regression to estimate the effect of the independent variables on this outcome (see the do-file for the generation of the Democratic dummy variable).
Consider the simple hypothesis that women are more likely to be Democrats. This hypothesis can be tested with logistic regression using the following command:
logit democrat female
The output looks like this:
The fact that the coefficient on the independent variable of interest is statistically significant lends some very general credence to our theory, but so far we don't have any real sense of whether the effect is big or small. Looking at marginal effects, rather than the logit coefficients, is one way of making a better assessment of that.
Unlike OLS coefficients, Logit and Probit coefficients don't have an intrinsic substantive interpretation attached to them. However, you can easily calculate marginal effects for variables of interest using the "mfx" command. By default, the command calculates the effect holding the other variables at their mean. However, you can manually alter this by adding to the command. In this case, we're just looking at a bivariate model -- there are no other variables besides the one we're focusing on -- so this is sort of beside the point.
Type the following command after you run your logistic regression:
The output looks like this:
The number under "dy/dx" is the marginal effect of a one-unit shift in the female variable (which is a dummy variable, so that means going from 0 to 1, which means male to female). It turns out women are about 8 percent more likely than men to identify as Democrats. The z-statistic ("z") of 3.94 is greater than 1.96, so this effect is statistically significant at the .05 level. Indeed, as indicated by the 0.000 under "P>|z|," it would be statistically significant even using a more stringent significance level than .05. Our theory is supported by the evidence.
Some outcomes of interest have more than two categories that clearly share an underlying order, but are not quite a scale. An example might be attitudes toward gay marriage: Some people oppose all legal recognition for gay couples. Others support civil unions for gay couples, but not equal access to the traditional institution of marriage. Finally, some people support full equal marriage rights. These three categories can certainly be ordered. The first is the most restrictive, the second is somewhere in between, and the third is the most permissive. Yet it's not necessarily the case that the space between "no legal recognition" and "civil unions only" is the same as the space between "civil unions only" and "full marriage rights." You certainly could use an OLS regression model to estimate such attitudes, but ordered logit and probit models might be more methodologically proper.
Ordered probit tends to be more commonly used, so we'll use that as our example. But again: the same general logic would apply if you chose to use an ordered logit model instead.
Let's consider the hypothesis that women are more likely to say they support full marriage rights than just civil unions. Let's also control for party identification and education. After recoding the gay marriage variable a bit (see the do-file for full replication code), I run the following command:
oprobit gaymarriage female partyid highschool collegedegree
This is the output:
The ordered probit coefficient for the female variable is statistically significant, lending some credence to our theory. However, the number associated with this variable (.10073) doesn't really mean anything substantive. So far all we really know is the relationship is not zero. To get more at the real nature and extent of the effect, you can calculate changes in probabilities.
You can easily calculate changes in probabilties using the "prchange" command. This command is part of a software package called SPost, which you can download here. You need to install this package before the "prchange" command will run. You can find various ways of doing that on the website, or you can just type:
net install spost9_ado
This provides a simple installation of a slightly older version, which will likely be fine for your purposes (this method assumes you are connected to the internet). After you run your ordered probit model, simply type this command to get changes in probabilities:
This gives you the following output:
Focus on the numbers immediately below the word female. This tells you the average change associated with this variable is about a three percent shift in probability in the liberal direction. The number under "1" tells you women are about 4 percent less likely to support the first category (no legal recognition) compared to the second category (civil unions only). The number under "2" is basically zero, which suggests women aren't really more or less likely to support civil unions. However, the number under "3" tells you women are about 4 percent more likely to be in the third category (full marriage rights), relative to the second category (civil unions only). Although 4 percentage points is not a huge change, it is consistent with our hypothesis.
Another potential outcome of interest is one where there are more than two categories, but no clear underlying order. An example might be a question asking respondents what they think the most important problem facing the country is. Answers might include unemployment, the war in Afghanistan, the legality of abortion, balancing the budget, or any other political issue. What makes this type of variable different is that its mulitple categories don't share any underlying order: Unemployment isn't "more" or "less" than the war in Afghanistan, which in turn isn't "more" or "less" than the legality of abortion, and so on. A multinomial logit model would be a better fit for this type of dependent variable, although the assumptions behind multinomial logit models are a bit more involved. The Stata command is "mlogit" and follows the same logic as every other regression command (i.e., if your dependent variable is mostimportantproblem and your independent variables are gender and partyid, your command would be "mlogit mostimportantproblem gender partyid"). However, if you aren't familiar with multinomial logistic regression models, you should be sure to read about how they work before just running the commands.