On hammers and nails

I just saw this article by Mike Konczal about the politically-influential 2010 result that economies with over 90% debt-to-GDP ratios perform very poorly. The article discusses one dubious point of methodology and two outright methodological errors, and discusses how a re-analysis of the same data with more solid methodology finds that debt-to-GDP ratio doesn't really seem to affect economic performance. I'm not going to talk about the substance of the papers; as I said in the first post, I don't want this blog to be political, and I'm not an economist anyway. I do want to talk about a potential source of their errors, however.

The authors of the original paper performed their entire analysis in a spreadsheet program, specifically Excel. There are reasons to avoid Excel in particular for serious analysis, but I think their two errors resulted from characteristics of spreadsheet programs in general. The first, and clearest, error was their accidental exclusion of countries with high debt but average economic performance from the economic growth average. This kind of mistake is very easy to make in a spreadsheet, because spreadsheets do not clearly differentiate values from formulas. That is, if you look at a cell, it is impossible to tell if the number in it is directly typed in, or computed from some other cells, and, if the latter, how the computation was performed. To check the formula, one has to click in the cell. Moreover, spreadsheet programs typically contain logic that tries to guess how formulas should be changed when the spreadsheet is changed, and when to update cells, and what to do when some value or computation is missing or invalid. All of this behind-the-scenes sleight-of-hand serves to make spreadsheet programs more "user-friendly" in the sense that users have to do less to get a number, but at the expense of opaque relationships between data and the statistical techniques used.

It also looks like the second error resulted from the athurs' reliance on spreadsheet-style analysis. Specifically, they compute the average economic performance by first computing an average performance per country, and then averaging the countries' averages. This is inappropriate because the data are un-balanced: there are many data points for some countries, but very few for others. Thus, we should be confident about the within-country average for countries with many data points, but less confident about the within-country average for countries with few data points. One of the commenters on Konczal's article put it well: their method is like taking one batter with 100 at-bats with a 0.200 average and another batter with 1 successful at-bat, and concluding that, together, they bat 0.600 on average.

Why would they weight countries in the average in this way? Well, I suspect it's because it is easy to do averages of averages in a spreadsheet: select the cells containing the averages, and click the "Average" option. As a non-economist with statistical training, my instinct for analyzing this data is to perform a mixed-effects regression of economic performance against debt-to-GDP ratio, with random effects (ideally both random intercepts and random slopes) for country, maybe country region (i.e. Europe, Asia, and so on), and a control fixed-effect of year. Such a model would estimate how the debt-to-GDP ratio predicted economic growth, accounting for the fact that the relationship may vary according to country and across time. However, spreadsheets typically do not contain such interesting models.

So: should we just avoid spreadsheets? I don't think so. Spreadsheets can be good for eyeballing data and doing preliminary analyses. I often use gnumeric to do just this. The problem comes from relying entirely on spreadsheets, or indeed any computational tool (although the "user-friendliness" of spreadsheets makes them more often instruments of abuse). All computational tools have drawbacks, and we should be familiar with a variety of tools. On the natural language toolkit (NLTK) mailing list, for example, I've seen e-mails asking how to build frequency lists from English text files. If you're building a larger application using NLTK, it might make sense to do this using NLTK (because then it will already be loaded into the program), but if all you want is a frequency dictionary, and your text is tokenized by spaces, then standard unix command-line tools are all you need. For example, given penn treebank-style strings in a file strings.txt:

this is an example string .
here is another string .

we can get one word per line with the command:

$ cut -d " " --output-delimiter "
" -f 1- strings.txt
this
is
an
example
string
.
here
is
another
string
.

(the double quotes allow the space and carriage return to be interpreted as characters rather than a tokenizer of the string or command execution, respectively). To get a frequency list, pipe the output into sort and uniq -c:

$ cut -d " " --output-delimiter "
" -f 1- strings.txt | sort | uniq -c
      2 .
      1 an
      1 another
      1 example
      1 here
      2 is
      2 string
      1 this

We can sort by decreasing frequency with another pipe:

$ cut -d " " --output-delimiter "
" -f 1- strings.txt | sort | uniq -c | sort -nr
      2 string
      2 is
      2 .
      1 this
      1 here
      1 example
      1 another
      1 an

If there are potentially many files scattered across several directories, there's no need to cook up a directory-traversal function yourself; use find:

find . -name "*.txt" -exec cut -d " " --output-delimiter "   
" -f 1- {} \; | sort | uniq -c | sort -nr
      2 string
      2 is
      2 .
      1 this
      1 here
      1 example
      1 another
      1 an

Text files that are not penn-treebank formatted may need a bit more piping and processing, but a frequency dictionary in and of itself does not demand a full-blown NLP library. Being proficient with several tools allows us to build this frequency dictionary more quickly. More importantly, even if we do end up building the frequency dictionary inside the python code, we have a variety of techniques for double-checking our results. Because those checks are relatively fast and easy, we are more likely to actually do them.

Finally, using command-line tools in particular is nice because it is easier to keep a record of what you did. In R, for example, you can use txtStart(file="logfile.log") and txtStop() from the TeachingDemos package to keep a text record of your entire session. Or, more generally, if you do your work in gnu screen, you can open a special window with screen -L that logs everything that happens inside the session to a file screenlog.#, where # is the number of the logged window, in the screen session's root directory.

Ok, I think I've gone on long enough.

Leave a Reply