Thomas de Graaff - Do regional economists answer the right questions?

Abstract

This position paper revolves around two main propositions; namely, (i) regional (or spatial) economists are very restrictive in the tool set they apply, and consequently (ii) their models do not always match with the type of questions policy makers are concerned about. To start with the latter, policy makers—whether national, regional or local—are oftentimes concerned about holistic approaches and future predictions. Exemplary questions are Which policy instrument works best for my city, What happens after the construction of this highway with housing prices and employment throughout the whole region and Given limited budget, which region should I first invest in. Regional economists—actually, most economists—usually isolate phenomena in order to, at best, explain the causal impact of a single determinant. Indeed, most regional economists feel very uncomfortable when asked to predict or give the best set of determinants for a certain phenomenon. This has its consequences for the tool set regional economists apply. Usually a parametric regression type of framework is applied isolating the determinant under consideration and controlling as much as possible for observables and unobservables, ideally in a pseudo-experimental framework. A direct consequence of this approach is that emphasis is very much on explaining the impact of an isolated determinant and not on predicting (non-marginal) changes in larger systems. For many research questions that is the right approach. For other research questions (revolving around non-marginal changes and predicting) we need other more data driven tools. Therefore, my main argument would be that we allow for more data science techniques in the toolkit of the regional economist, both in research and in education.

1 Introduction: two different cultures

The sexiest job in the next 10 years will be statisticians. (Varian 2014)

The quote above from Hal Varian is in one aspect wrong; nowadays, we do not call them statisticians but data scientists instead. Nevertheless, in the last two decades companies such as Google, Ebay, Whatsapp, Facebook, Booking.com and Airbnb, have not only witnessed enormous growth but to a considerable extent also changed the socio-economic landscape. Indeed, with the increasing abundance of (spatial) data and computer capacity, the ability to gather, process, and visualize data has become highly important and therefore highly in demand as well. And all the models and tools these data scientists within these companies use are very much data driven with often remarkable results.

In his controversial and path-breaking article, Breiman (2001) presented two different cultures in statistical science. One governed by a (probability) theory-driven modeling approach and one governed by a more (algorithmic) data-driven approach. These two cultures carry over to the econometric and ultimately the empirical regional economics domain as well, where—commonly for all social sciences—the theory driven approach still very much dominates the landscape of the realm of contemporary regional economics.

I use a wide definition for the regional economics domain, which consists of most aspects of regional science in general but for which the theoretical approach is always from an economic perspective. Topics such as, e.g, interregional migration, trade, transport flows and commuting on the one side and regional performance, regional clustering, population growth and specialisation on the other side fall all under this, admittedly, rather wide umbrella.

Figure 2 is an adaptation from the one displayed in Breiman (2001) and describes the processes governing these two cultures. Figure 2 (a) is what I refer to as the modeling approach, where a statistical model is postulated and is central to this culture. This is the classical approach where statistical probability theory meets the empiricism of Karl Popper. Usually the model assumed is stated as a linear model and in its most simple form can be denoted as:

Sometimes as well referred to as the frequentists’ approach. However, this typically concerns the debate between classical statistics and Bayesian statistics, where the two approaches I refer to are more concerned with wider frameworks, of which the Bayesian approach is just one of the elements. I come back to Bayesian statistics and the frequentists’ approach later, but I do not see them necessarily as opposites. And I quite object to the term frequentists’ approach as Bayesian statistics is much more focuses on counting then the frequentists’ approach.To be honest, Popper himself was not a great fan of simply null hypothesis testing. He actually argued for the falsification of explanatory models, where in his view falsification does not only rely on statistics but on consensus amongst scientists as well.

\[ \mathbf{y} = \mathbf{x}\beta + \epsilon, \tag{1}\] where in (regional) economics language, $\mathbf{x}$ is referred to as the independent variable, $\mathbf{y}$ as the dependent variable and $\epsilon$ as a residual term. In this setup, using the data at hand, one constructs a statistical test to which extent the estimated coefficient (denoted with $\hat{\beta}$) deviates from a hypothesized value of the coefficient (denoted with $\beta_0$)—typically the hypothesis $H_0: \hat{\beta} = 0$ is used with as alternative hypothesis that $H_1: \hat{\beta} \neq 0$. However, that is always within the context of the model. So, when the null-hypothesis is rejected, it not necessarily means that the true $\beta$ is unequal to zero, it might also be caused by errors in measuring $\bf{x}$ or even using the wrong ![One of the assumptions for regression techniques such as the one used here is actually no misspecification of the model, but—apart from some possible tests on the functional form a specific regression form—usually little attention is give on the validity of the model used. More importantly, within this framework the model itself is usually not tested .]

There is another fallacy with this approach that is often overlooked and that is that the alternative hypothesis being true is a probability as well. Namely, most hypotheses researchers test are typically not very probable. Not taken this into account would actually lead to more null hypotheses to be rejected then should be (false positives).

Figure 2 (b) yields a schematic overview of a more data driven approach. Here, we see an unknown model fed by predictors $\mathbf{x}$ that lead to one or multiple reponses $\mathbf{y}$. The main objective here is not to test hypotheses, but to find the best model instead which able to explain the data and to predict the data. Usually, the models are evaluated by some kind of criterion (e.g., the mean squared error), which is not completely unlike the modeling approach. However, there are two main differences between the two approaches. First, the data driven approach considers several models in a structural approach. For instance, the question which variables to include is captured by an exhaustive sourse of all combinations in the modeling approach (e.g., with classification and regression trees or random forests), while in the theory driven approach, the choice of variables is based on the theory and a small number of variations in the specification. Second, measurements on model performance are done in the data driven approach and, typically, in the model approach. The latter is not that important for hypothesis testing, but for prediction this matters enormously, because adding parameters might increase the in-sample fit, but actually worsen the out-of-sample fit (a phenomenon called overfitting).

In economics in general, and in regional economics in specific, most of the tools employed are very much theory or model driven instead of data driven. My (conservative) estimate would be that at least 90% of all empirical work in regional economics revolves around postulating a (linear) model and testing whether (a) key determinant(s) is (are) significantly different from a hypothesized value—usually zero. That is, the context of the model assumed.

In a seminal contribution, Breiman (2001) states that deep into the 90s 98% of the statisticians actually employed the theory driven paradigm and only 2% a data driven paradigm. With the advent of the availability of internet connectivity, large (online) data sources, and faster computers the statistical realm changed dramatically. However, this has not permeated yet in the social sciences (see as well Varian 2014).

At best, this approach can be seen in a causal inference framework. If a determinant (such as a policy in the context of regional economics) $x$ changes, does it cause then a change in the output $y$ (most economists typically use some welfare measure). This approach thus provides a rigid and useful approach to regional policy evaluation. If we implement policy $x$, does welfare measure $y$ then improve? Note that this always considers a marginal change as $x$ is usually isolated from other (confounding) factors.

Most of this research actually intends to mimic a difference-in-difference approach and gained enormous momentum with the textbook of Angrist and Pischke (2008).

However, policy makers oftentimes have different questions for which they need solutions. Usually, they revolve around questions starting with What determines performance measure $A$?, Which regions can we best invest in? or, more generally, What works for my region?. These types of questions require a different approach than the previous one. Namely, the former type requires an approach focused on explaining while the latter type requires an approach focused on predicting.

The remaining part of this position paper is structured as follows. Section 2 gives an overview of current modeling practices and describes the `traditional’ inference based approach as well as some data-driven approaches that have been used in the recent past (though by far not as often as the traditional methods). Section 3 sets out both a research and an education agenda as it addresses how to bridge the gap between the daily practices of regional economists and the demands of local policy makers. The final section shortly summarizes the main points raised in this position paper.

2 Regional economists turning the blind eye

Unmistakably, in the recent decade the two major changes to economic empirical research in general are the advent of increasingly larger data sources and the large increase in computer power (Einav and Levin 2014). The methods that most economists employ, however, have not changed. Linear regression or one of its close relatives (such as logistic, poisson or negative binomial regression), preferably in a causal framework, is still the most common tool. This also applies to regional economists, who—although coming from a tradition to use various methods from different disciplines—have increasingly used similar methods as in ``mainstream’’ economics.

This focus on marginal effects and causality is certainly very worthwhile and brought us many important insights. However, it is also typically done within a very narrow framework and, below, I will lay out what we are missing both in research and in our educational curricula, when our focus is on the framework above and as advocated so much as in Angrist and Pischke (2008).

3 Incorporating the data science culture agenda

The previous section discussed contemporary and cutting-edge applied econometric methods of (regional) economists. As argued, these methods have merits. Not least because they are all geared towards identifying causal relationships, and to a far greater extent then in other social sciences.

However, these methods do come at some costs. First of all, the results should be interpreted as marginal effects. A small change in $x$ causes a certain change in $y$. Second, the effect is always ceteris paribus. All possible other factors are controlled for. Third, most of these methods face difficulties with heterogeneous effects. Fourth, and final, the underlying statistical framework is often difficult to interpret—for students, scholars, and the audience at large.

These disadvantages do have serious consequences for what this traditional toolkit can and what it cannot. First of all, it is very good in explaining but very bad in predicting. Second, system-wide changes and impacts are difficult to incorporate. Third, it has difficulties with different heterogeneous subgroups. Fourth, the underlying statistical framework makes it difficult to evaluate models. And, as last, the statistical framework also make it difficult to deal with non-linear models and non-parametric techniques are difficult to yield with this specification.

Below, I first explain how using techniques from the data driven approach side, or the data science side, can help research in the field of regional economics further in three directions: model comparison, heterogeneous effect sizes, and predicting.

Thereafter, I describe what needs to be changed in education, so that future students will better enabled to deal with the abundance of larger datasets, the need for better predictions and a more intuitive understanding of probabilities and testing.

3.1 In research

3.1.1 Dealing with heterogeneity

One of the weaknesses of the theory driven approach—or the more classical research methods—is dealing with heterogeneity. Fixed effects regressions only deal with removing level effects and not varying slope effects, difference-in-difference designs only give average treatment effects and instrumental variables have difficulties with incorporating heterogeneity.

There are some advances made in introducing heterogeneous instruments in quantile regression techniques, but the exact mechanism is still not clear-cut.

The argument made against heterogeneity is that it only affects efficiency (i.e., the standard errors), but in most cases this is not true. In non-linear models, such as discrete choice and duration models heterogeneity affects the consistency (i.a., the parameter of interest) as well. Moreover, interpretation of the parameter of interest might be completely off when not allowing for heterogeneous groups.

Figure 4: A mixture of two distributions of housing prices

Consider Figure 4 where in the left panel a density distribution is given from a distribution of lognormally distributed housing prices. When interested in explaining the effect of a variable $x$ on housing prices this is typically the first descriptive plot an applied researcher creates. The middle panel enlights this plot further by combining both a density plot and a histogram. But what if the sample consists of types of housing markets. One overheated and one with ample housing supply. Then most likely the mechanism on both markets are different and the effect $\beta$ could be very well different for both markets. Indeed, the right panel shows that the density distribution from the left panel is actually a realization of a mixture of two (in this case normal) distributions. The housing market with ample supply of houses is then represented by group 1 and the overheated housing market is represented by group 2.

These latent class approaches are typically not much applied in (regional) economics (see for an exception, e.g., Lankhuizen, De Graaff, and De Groot 2015). However, correct identification of submarkets or subgroups could be very important for policy makers as the average treatment effect may very well not even apply to anyone (an argument, in Dutch, made as well in Graaff 2014).

In other economic or social science fields, such as market and transportation science, however, this is already a common approach.

Slowly, the notation that fixed effects contain much useful information permeated in the regional economics domain. An insightful and straightforward way to do this is by adapting the wage model in Equation 6 as follows:

\[ \begin{align} \ln(w_{ic}) &= \xi_c + \beta \ln(d_{ic})+\mathbf{z_{ic}}\gamma + \epsilon_{ic} \notag \\ \xi_c&=\alpha + \mathbf{x_c}\delta + \mu_c, \end{align} \tag{7}\]

where the individual wage model is now split up in two stages. The first stage model the individual variation and regional fixed effects. The second stage now regresses regional variables on the estimated fixed effects. This approach is now frequently applied (for example, in the so-called sorting model Bayer, McMillan, and Rueben 2004; Bayer and Timmins 2007; Zhiling, Graaff, and Nijkamp 2016; and Bernasco et al. 2017).

Two large advantages of this approach are that the standard errors on both the individual and the regional level are correctly estimated and that, if needed, instrumental variable methods may be applied in the second stage. There is one disadvantage and that is the fixed effects in the second stage are not observed but estimated (imputed) and that has an effect on the standard errors.

Note that model Equation 7 is very much alike multilevel models, which are very often used both in the data driven approach and in other social science apart from economics. Multilevel modeling works great in both correctly estimating a model with observations on various levels (such as individuals, firms, sectors and regions) and in retrieving heterogeneous estimates (both in levels and in slopes). And with the increasing advent of micro-data, combining a individual-regional model as in Equation 7 with the more rigorous structure of multilevel modeling is definitely worth more attention in the near future.

Interestingly, more (spatial) non-parametric approaches (see, e.g., the geograpically weighted regression exercise in Thissen, Graaff, and Oort 2016) have become more popular as well in the last decade (typically, because of increased computer power). This approach needs more attention as well, as the connection with (economic) theory is often lost. And, especially regional geographers apply spatial non-parametric techniques, not the regional economists.

3.1.2 Model comparison

One element that is notoriously weak in the theory-driven approach is model comparison—or the models should be nested. And in many cases, model comparison is often very much asked by policy makers and the audience at large, if not only for finding the correct specification. The latter question is concerned with the question which variables (predictors) to include in a model and which predictors of them perform best. Note that this is analogous to questions regional policy makers might have: such as, which policy instruments best to deploy given limited financial budgets.

A typical example can be found in the field of spatial econometrics model where comparison is an important issue as typically there are several competing theories, non-nested, for the distance decay function (usually measured with a so-called spatial weight matrix $\mathbf{W}$). And usually those theories are very much related (e.g., distance decay measured in Eucledian distance or generalized travel costs).

Another field where model comparison is of large importance is in the estimation of the strength of socio-ecoomic networks. In theory, socio-economic networks should produce so-called power-laws: or a loglinear relation between the size of nodes and the number of connections. Empirically, these relations often follow a slightly different distribution. What kind of distribution fits then best is still a matter of debate.

One of the most famous of these relations is Zipf’s law, where the ordering of cities and the size of the population follows an almost perfect loglinear distribution; see for an in-depth treatment Gabaix (1999).

For proper model comparison, a Bayesian approach is almost unavoidable. The key difference between the frequentist and the Bayesian approach is how to interpret uncertainty. In the frequentist approach uncertainty originates from sampling, while in the Bayesian approach uncertainty is caused by not having enough information. So, a Bayesian statistician lives in a deterministic world but has a limited observational view.. Note that the rule of Bayes is not unique for Bayesian statistics. Namely, this rule is central for all probability theory.

Both frequentists and bayesians rely heavily on sampling. However, sampling in frequentist statistics is a device to construct undercertainty around an estimate. Sampling in Bayesian statistics is a way to perform integral calculus (or to simulate observations).

What is unique for each Bayesian model is that it has a prior and posterior. The prior is an assumption about something that you do not know (uncertainty measured by a parameter). With additional information (data), knowledge about the uncertainty is then updated (and hopefully the uncertainty is diminished). The updated probabilities are represented in a posterior distribution. To understand the probabilities then is simply a matter of sampling from the posterior distribution. So, the frequentist approach typically give a point estimate of a parameter, the Bayesian approach gives the whole distribution of the parameter. Note that under the Bayesian paradigm, everything (including the data) is regarded as an variable with associated uncertainty.

\[ \begin{align} \ln(h_r) & \sim \text{Normal}(\mu_r, \sigma) \tag{likelihood}\\ \mu_r & = \alpha + \beta x_r \tag{linear model}\\ \alpha & \sim \text{Normal}(12,3) \tag{$\alpha$ prior}\\ \beta & \sim \text{Normal}(5,10) \tag{$\beta$ prior}\\ \sigma &\sim \text{Uniform}(0,2) \tag{$\sigma$ prior} \end{align} \tag{8}\]

Equation 8 gives an example of a Bayesian linear regression model.. Here, we want to model the relation between regional housing prices ($h_r$) and the regional percentage of open space ($o_c$). Note that all parameters and the distribution of the data (likelihood) require distributional assumptions. This is a disadvantage in relation to the inference based frequentist approach, where no distributional assumptions are needed. But, note as well that Equation 8 specifies explicitly all assumptions for this model (e.g., a linear model and a normal distribution for the likelihood). If you think the model is incorrect you can rather easily change the assumptions.

Under relatively mild assumptions this should yield similar results as the frequentist approach.

Estimating Bayesian models have always been computationally cumbersome, especially with more parameters as sampling from the posterior distribution equalizes sampling from a multi-dimensional integral. Fortunately, computational power has increased dramatically in the last decades and techniques for sampling from the posterior distribution have become rather efficient (the most often used techniques nowadays are Monte Carlo Markov Chain algorithms which is basically a simulation of the posterior distribution).

Although Bayesian statistics has already been applied to spatial econometrics (see the excellent textbook of LeSage and Pace 2009), applications have not permeated much to other fields in regional economics, such as in regional growth estimations, individual-regional multi-level modeling and population-employment modeling.

3.1.3 Predicting

A last field not well developed in (regional) economics is that of predicting. Most economists would shy away from predictions as, in their opinion, identifying causal relations is already difficult enough (they have a point there). What economists love to do is giving counterfactuals instead. For example, if regional open space would decrease significantly, what would happens with regional housing prices. This counterfactual approach looks very much as a prediction, however there are two large disadvantages associated with counterfactuals.

In popular media, though, they are less hesitant to offer predictions.

First, counterfactual are always made in sample. Actually, all marginal effects are made in-sample. Splitting the sample in a training set and a test set is not something that (regional) economists are prone to do. There is an intrinsic worry then for , especially when using many fixed effects. Explanatory power may be very high, but could also be very situation related. What works in one region, does not necessarily works in another region. Note that predicting in spatial settings is more difficult as the unit of analysis is typically a spatial system. And sub-setting a spatial system in a training and test set is often difficult.

Consider, e.g., the following often used gravity model in linear form as depicted in Equation 9:

\[ \ln(c_{ij}) = \alpha + \beta \ln(P_i) + \gamma \ln(E_j) + \ln(d_{ij}) + \epsilon_{ij}. \tag{9}\]

Here, we aim to model the number of commuters ($c$) from region $i$ to region $j$, by looking at the total labor force $P_i$ in region $i$, the total number of jobs $E_j$ in region $j$ and the distance ($d_{ij}$) between the two regions. Suppose, we can improve the distance between region $i$ and $j$ by, e.g., enlarging highway capacity. This does not only change the commuter flow between $i$ and $j$, but also between other region; say between $i$ and $k$. As usual there is no free lunch and total employment and population in each should remain constant, at least in the short-run.

However, this make sub-setting difficult and correctly predicting cumbersome. But, Equation 9 of above is just one example of a large class of models that all face this difficulty. And policy makers (and firms) are actually very much interested in the questions associated with these predictions. Questions related to the impact of Brexit on other countries, total network effects of infrastructure improvements, identifying profitable routes for airlines, impact of housing projects on commuting quickly come to mind. So, it is especially the relation between predicting and spatial (interaction) systems that need considerable attention.

A second problem with the counterfactual approach is that it considers marginal changes. Unfortunately, in models as Equation 9 this would not work. A marginal change on the link between $i$ and $j$ would have marginal changes on most other links. Marginal changes in a network setting is still a relatively underdeveloped area.

That is, in applied empirical statistical work. Computational equilibrium models and to a lesser extent input-output models are able to model network-wide impacts. However, these models are cumbersome to construct and are less based on data.

So, on of the main research challenges in the regional economic domain for the near future would be to combine the data science models with the concept of spatial interaction models in such a way that both predictions can be made and that model restrictions are still satisfied.

3.2 In education

As discussed above, in regional economics—in fact, in all social sciences—introductory statistics courses are like cook books. In this situation you need that recipe (test), in that situation you need that recipe (test). These recipes even perfectly coincide with the drop-down menu from certain, grapically user interface driven, statistical software packages such as SPSS. This causes students not to understand the underlying mechanism but just to apply procedures (or actually push buttons).

Without going into the need for using a frequentist or a Bayesian approach (they coincide more than most people think), I would actually argue very much for already using computers and coding in an early phase in students’ education. This could coincide with more traditional probability theory, but has an advantage that it is a general approach instead of a flowchart.

Figure 5: Overlapping normal distributions

As an example, consider Figure 5 where two normal distributions are depicted. The left one has a mean of $-1$, the right one has a mean of $1$. Both have a standard deviation of 1. The question is now to what extent these distributions are different from each other. (Slightly rephrased: this is actually the problem whether two coefficients in a frequentist framework are different from each other). One approach is to search for a suitable test (and quickly run in a plethora of $F$, $z$ and $t$-tests); so, following a flowchart again.

Another approach would actually be to count—well, take the integral of —the number of observations in the area that belongs to both distributions. By hand this is infeasible, but with a computer this is rather easy. Just sample a reasonable amount of realisations from both distributions (say $N$ times) and count how many times the realisation from the second distribution is smaller than the first distributions. To get a probability, divide by $N$. In R the code simply boils down to:

  N = 100000          # number of draws
  # Draw from first distribution
  n1 <- rnorm(N, (-1), 1) 
  # Draw from second distribution
  n2 <- rnorm(N, 1, 1) 
  # Count how many times second dist.
  # is smaller than first dist.
  count <- sum(n2 < n1)    
  count/N       # get probability

And for those who are interested, the probability is approximately $0.079$. Note that this is a full-blown probability and not so much a test. You could easily turn this into a test when comparing this with a pre-defined probability. If you find this probability to high, then you actually have large doubts whether these two coefficients are different.

Note that I refrain from using the term significance level. This concept is truly a frequentists’ concept and on a fundamental level relies on the concept of infinite sampling to get uncertainty.

Although this approach is definitely intuitive and arguable very powerful (if you understand the approach above, you basically understand Bayesian statistics as well) it does require computer literate skills from students. And contrary to popular belief, most students actually face large difficulties with coding, command line tools, working with file systems, and so on. This is caused by the fact that all tools they usually work with are driven by drop-down menu’s, templates and strong graphical user interfaces.

Typically they work with Powerpoint, Excel and Word.

This is also caused by the fact that in regional economics (actually in al the social sciences), remarkably little attention has been given to the set of computer-related tools students could use, why they should use them and the relation between them (with some exceptions as, amongst some others, by Rey 2014; Arribas-Bel and Graaff 2015; Arribas-Bel, Graaff, and Rey \noop{3002}Forthcoming). This is even more remarkable as reproducibility and robustness of results become more important in research and teaching.

And this does not not apply for statistical software tools such as R or Python, but as well to other fields. Gathering data, manipulating data, visualising data and communing results are all skills that arguably are very important for students and scientist and become even more important in the future (Varian 2014). There are some exceptions as Schwabish (2014), but in (regional) economics these skill still receive not much attention—in research, but especially in education.

I conclude this section by arguing that we miss three main elements in our curriculum. The first on being a larger emphasis on computer literature skills, such as coding, command line tools, visualization of data, an so forth. The second is more room for the data driven approach, where using software packages such as R of Python, problems are solved with data science techniques, with its larger emphasis on predicting and non-linear modeling. To be clear, simulation exercises as above functions as well as a data driven approach. Most importantly, students should understand the underlying mechanism, instead of applying procedures. The third and final element that is missing is consistently throughout the curriculum. This is often understood as applying the same tools for each course, but this is not necessarily the case. What I mean with consistently is that elements from method courses should come back in regular courses. Nowadays, most courses could implement an empirical element, such as regression techniques, data visualization, data manipulation, and perhaps coding as well. Why otherwise give a Python in the first year of the bachelor, without using that in other courses?

4 Into the abyss

I started this paper with the observation, that, in the words of Breiman (2001), there seems so be two cultures in statistical or econometric modeling; a theory driven and a data driven approach. These two approaches are not mutually exclusive, but complementary. And both have their own strengths and weaknesses. However, especially in economics—and thus in regional economics as well—the theory driven approach still seems to be highly dominant, even with the advent of increasingly larger (micro-)databases. Arguably, this is problematic as the theory driven approach has difficulties when answering questions typically asked by policy makers; questions such as What works best for my region?, What happens with the capacity of my whole network when I invest in a specific highway link? and In which region should I invest to get highest returns?.

So, the main argument of this paper lies in introducing more data approach/data science techniques in the toolkit of the regional economist. Other related fields, even in the social sciences, have already made large advances, such as predictive policing in criminology, latent class approaches in transportation modeling, and the use of deep learning techniques in marketing sciences.

Obviously, this needs large investments (mostly in time), both for researchers and for teachers. The first group needs to invest in new techniques and probably in new statistical software. The second group needs to change parts of the curriculum in terms of the specific contents of methods courses and exercises. Fortunately, many online and open source manuals, videos and even textbooks are available. Moreover, companies such as DataCamp allow for free subscriptions as long as the material is used for classes.}

Especially for the R programming environment R Core Team (2017) there is a vast amount of material available on the internet, such as R for Data Science and Efficient R programming.

To conclude, I would like to note that apart from the intrinsic scientific arguments there are two other very compelling arguments to invest at least some time in data driven approaches. First, it coincides wonderfully with other techniques, such as versioning, blogging (publishing to HTML), and command line tools. All these approaches ensure that research becomes more reproducible. Something that becomes more and more a hard requirement by both university and the audience at large. Second, when looking at recent advances both in industry (e.g., all the dotcom companies but also others, such as more traditional media companies) and in other scientific disciplines, it is not the question if regional economists should invest more in the data science approach, but the question how soon can we start.

References

Angrist, Joshua D, and Jörn-Steffen Pischke. 2008. Mostly Harmless Econometrics: An Empiricist’s Companion. Princeton university press.

Arribas-Bel, Daniel, and Thomas de Graaff. 2015. “WooW-II: Workshop on Open Workflows.” REGION 2 (2): 1–2.

Arribas-Bel, Daniel, Thomas de Graaff, and Sergio Rey. \noop{3002}Forthcoming. “Looking at John Snow’s Cholera Map from the XXIst Century: A Practical Primer on Reproducibility and Open Science.” In Regional Research Frontiers: The Next 50 Years, edited by Randall Jackson and Peter Schaeffer. Berlin: Springer.

Bayer, Patrick, Robert McMillan, and Kim Rueben. 2004. “An Equilibrium Model of Sorting in an Urban Housing Market.”

Bayer, Patrick, and Christopher Timmins. 2007. “Estimating Equilibrium Models of Sorting Across Locations.” The Economic Journal 117 (518): 353–74.

Bernasco, Wim, Thomas de Graaff, Jan Rouwendal, and Wouter Steenbeek. 2017. “Social Interactions and Crime Revisited: An Investigation Using Individual Offender Data in Dutch Neighborhoods.” Review of Economics and Statistics 99 (4): 622–36.

Bettencourt, Luis, and Geoffrey West. 2010. “A Unified Theory of Urban Living.” Nature 467 (7318): 912–13.

Breiman, Leo. 2001. “Statistical Modeling: The Two Cultures.” Statistical Science 16 (3): 199–231.

Deaton, Angus. 2010. “Instruments, Randomization, and Learning about Development.” Journal of Economic Literature 48: 424–55.

Einav, Liran, and Jonathan Levin. 2014. “Economics in the Age of Big Data.” Science 346 (6210): 1243089.

Gabaix, Xavier. 1999. “Zipf’s Law for Cities: An Explanation.” The Quarterly Journal of Economics 114 (3): 739–67.

Graaff, Thomas de. 2014. “Stedelijke Voorzieningen En Bevolking: Wie Woont Waar En Waarom?” in: Jessie Bakens, Henri de Groot, Peter Mulder and Cees-Jan Pen (eds.), Soort zoekt soort: Clustering en sociaal-economische scheidslijnen in Nederland, platform31.

Graaff, Thomas de, Frank G. van Oort, and Raymond J. G. M. Florax. 2012a. “Regional Population-Employment Dynamics Across Different Sectors of the Economy.” Journal of Regional Science 52 (1): 60–84.

———. 2012b. “Sectoral Heterogeneity, Accessibility and Population-Employment Dynamics in Dutch Cities.” Journal of Transport Geography 25: 115–27.

Imbens, Guido W, and Joshua D Angrist. 1994. “Identification and Estimation of Local Average Treatment Effects.” Econometrica: Journal of the Econometric Society, 467–75.

Lankhuizen, Maureen B. M., Thomas De Graaff, and Henri L. F. De Groot. 2015. “Product Heterogeneity, Intangible Barriers and Distance Decay: The Effect of Multiple Dimensions of Distance on Trade Across Different Product Categories.” Spatial Economic Analysis 10 (2): 137–59.

LeSage, James, and Robert Kelley Pace. 2009. Introduction to Spatial Econometrics. CRC Press.

McElreath, Richard. 2016. Statistical Rethinking: A Bayesian Course with Examples in r and Stan. Vol. 122. CRC Press.

R Core Team. 2017. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Rey, Sergio J. 2014. “Open Regional Science.” Annals of Regional Science 52 (3): 825–37.

Roback, Jennifer. 1982. “Wages, Rents, and the Quality of Life.” Journal of Political Economy 90 (6): 1257–78.

Schwabish, Jonathan A. 2014. “An Economist’s Guide to Visualizing Data.” The Journal of Economic Perspectives 28 (1): 209–33.

Thissen, Mark, Thomas de Graaff, and Frank G. van Oort. 2016. “Competitive Network Positions in Trade and Structural Economic Growth: A Geographically Weighted Regression Analysis for European Regions.” Papers in Regional Science 95 (1): 159–80.

Varian, Hal R. 2014. “Big Data: New Tricks for Econometrics.” The Journal of Economic Perspectives 28 (2): 3–27.

Zhiling, Wang, Thomas de Graaff, and Peter Nijkamp. 2016. “Cultural Diversity and Cultural Distance as Choice Determinants of Migration Destination.” Spatial Economic Analysis 11 (2): 176–200. https://doi.org/10.1080/17421772.2016.1102956.

Citation

For attribution, please cite this work as:

Graaff, Thomas de. 2021. “Do Regional Economists Answer the Right Questions?” June 5, 2021. http://thomasdegraaff.nl//posts/post/research_agenda.

Do regional economists answer the right questions?

1 Introduction: two different cultures

2 Regional economists turning the blind eye

2.1 The blind eye in research

2.1.1 Exogeneity versus independence

2.1.2 Local and average treatment effects

2.1.3 Fixed effects and heterogeneity

2.2 The blind eye in education

2.2.1 Statistics & Applied Econometrics

2.2.2 Data science