Regression Analysis

Steven Levitt, one of the most famous economists alive and author of Freakonomics, once said, “Regression analysis is more art than science.”  After creating my own complicated regressions for the first time, I am starting to agree with him!

A regression is a statistical technique to determine the effect a number of independent variables have on one dependent variable.  It is one of the key tools in an economist’s toolbox for answering questions of how variables affect each other.  For example, in my first Econometrics class last year our final project was to create our own regression on whatever topic interested us.  I decided to look at Major League Baseball and see what affected attendance.  I included a number of variables like stadium size, how good the team was, how expensive the average ticket was, and a number of other items that could affect demand for tickets.  These are the types of questions that can be answered by regression; it works best went trying to see how a number of different measures affect one variable.  After the running of a regression, statistical software gives an output of coefficients, which essentially tells you how much a one unit change in an independent variable affects the dependent variable.  For example, in my attendance regression I included winning percentage, which turned out to have a coefficient of 26,063.  What does that number mean?  I put in winning percentage as a decimal, so if a team won 60% of its games the winning percentage was input as .600.  If a team won 10% more games, about sixteen more over a 162 game season, and its winning percentage improved by .100 to .700, then its attendance could be expected to go up by .100 x 26,063 or about 2600 fans per game.

It is important to note a few caveats with regression analysis, as it is not a perfect science.   If you forget to include an important variable then all of your coefficients will be essentially meaningless.  Also, as in my above example, it’s important to know the scale of the variable.  26,063 seems like it has a huge effect, but since I input winning percentage rather than straight wins its effect is actually much smaller than it seems.  A final note is that correlation does not imply causation.  Just because the variable is an independent variable in a regression does not mean it is a causal variable.  It is possible, though unlikely, that it is not attendance is not affected by the team’s winning percentage, but perhaps instead it is high attendance makes teams win!  While highly improbable, it is important when running regressions to carefully avoid the many possible missteps that could mask the true effects that you are trying to find.  This is what I have been thinking about as I try to interpret my initial regression looking at physician-hospital integration’s affect on quality.  What do these coefficients mean, and did I forget anything?