Introduction

sortingmod is a package for estimating the sorting model - a discrete choice model which explains the location decision of heterogeneous individuals over a set of alternative locations. The model is developed by Bayer et al. (2004) following the work of Berry et al. (1995). It relies on the assumptions that individuals choose a location that maximizes their utility, and that heterogeneous individuals with different characteristics have different preferences, and different valuation for location characteristics. The results of the model provide choice probabilities for each alternative, as well as insight on the valuation patterns of heterogeneous agents, and marginal willingness-to-pay values for location characteristics.

The package is constructed as an accompanying tool for the Summer School in “Hedonic price analysis and the residential location choice”, hosted by the Kraks Fond - Institute for Urban Economic Research (Copenhagen, Denmark).

Estimation of the residential sorting model

Sortingmod package follows the methodology described in Bayer et al. (2004) to analyze how heterogeneous individuals sort themselves into alternative locations based on their preferneces for local characteristics.

sortingmod solves the following model:

\[U_{j}^{i} = \sum_{k=1}^{K}\beta_{k,0}X_{k,j} + \sum_{k=1}^{K}\sum_{l=1}^{L}\beta_{k,l}(Z^{i}_{l}-\bar{Z_{l}})X_{k,j} + \xi_j + \epsilon_{j}^{i}\]

Where: \(U_{j}^{i}\) is the utility of individual \(i\) from alternative \(j\), \(X_{k,j}\) is the value of the \(k\)-th alternative characteristic in location \(j\), \(Z^{i}_{l}\) is the value of the \(l\)-th characteristic of individual \(i\), \(\bar{Z_{l}}\) is the sample mean of individual characteristic \(l\), \(\beta_{k,l}\) is the cross effect coefficients, \(\xi_j\) is an alternative specific error term, and \(\epsilon_{j}^{i}\) is the individual error term.

Due to likely correlation between prices and unobserved characteristics, it is necessary to estimate the sorting model in two stages.(See discussion in Berry et al. (1995)).

First stage

The first stage of the sorting model involves estimating the cross effect coefficients, and treating \(\delta_{j} = \sum_{k=1}^{K}\beta_{k,0}X_{k,j}\) as alternative specific constants. Assuming that the random term is IID extreme value type I distributed, choice probabilities can be derived by a logit model estimation:

\[P^{i}_{j} = \frac{e^{V_{j}^{i}}}{\sum_{j=1}^{J}e^{V_{j}^{i}}} = \frac{e^{(\delta_{j} + \sum_{k=1}^{K}\sum_{l=1}^{L}\beta_{k,l}(Z^{i}_{l}-\bar{Z_{l}})X_{k,j})}}{\sum_{j=1}^{J}e^{(\delta_{j} + \sum_{k=1}^{K}\sum_{l=1}^{L}\beta_{k,l}(Z^{i}_{l}-\bar{Z_{l}})X_{k,j})}}\]

The cross-effect coefficients and alternative specific constants are estimated by maximizing the probability that individual \(i\) choose location \(j\), using a maximum likelihood procedure (BHHH (Berndt et al., 1974)).

Second stage

The second stage of the estimation involves estimating the equation:

\[\delta_{j} = \sum_{k=1}^{K}\beta_{k,0}X_{k,j} + \xi_j\]

where the vector of mean indirect utilities (\(\delta_{j}\)) is explained by the alternative characteristics.

Endogeneity in the observed alternative characteristics

Since prices are likely to be correlated with the error term \(\xi_j\) the second stage estimation follows a two stage least squares (2SLS) procedure to account for this endogeneity problem. Here, we follow the approach of for instrument construction and construct a price instrument assuming no correlation exists with unobserved location heterogeneity. Namely, we compute the price vector which would clear the market under the assumption of \((\xi_j = 0)\).

Quick user guide

Installation

The purpose of this section is to give a general sense of the package, its components and an overview of how to use it. sortingmod is not currently available from CRAN, but you can install the development version from github with (note that you need to install the package devtools to be able to install packages from GitHub:

# install.packages("devtools")
# library("devtools")
devtools::install_github("thdegraaff/sortingmod")

Once installation is completed, the package can be loaded.

library(sortingmod)

Data structure

The data for the estimation should be in data frame format, and it must contain:

  • One or more columns indicating characteristics of a unique agent making a discrete choice between available alternatives (individuals, households, etc.).
  • A vector indicating the chosen alternative.
  • One or more columns indicating characteristics of the chosen alternative.

Users can work on their own data, or use the data which is saved in the workspace.

##           id age    income hh_kids mun_code  lnprice    nature monuments
## 1 1202400414  46 11.248465       1       34 12.04515 0.1879051  2.639057
## 2 1201400454  56 11.493365       0       34 12.04515 0.1879051  2.639057
## 3 1109400358  22  9.588434       1       34 12.04515 0.1879051  2.639057
## 4 1202400342  35 11.446497       1       34 12.04515 0.1879051  2.639057
## 5 1110400442  49 11.040759       1       34 12.04515 0.1879051  2.639057
## 6 1201400437  49 10.672785       1       34 12.04515 0.1879051  2.639057

Notes:

  • Each row represents a unique individual. The number of alternatives should therefore be smaller or equal to the number of individuals.
  • Individual and alternative characteristics should be numeric variables. Factor variables are not yet supported, so it’s best to transform them into numeric dummy variables if you intend to include them in the model.

First stage estimation

The function first_stage is used in order to estimate the first stage of the model. first_stage takes four arguments:

  • code_name - which indicates (with name or column number) the vector with alternative chosen.
  • Z_names - which indicates (with names or column numbers) the vectors with individual data.
  • X_names - which indicates (with names or column numbers) the vectors with alternative data.
  • data - which indicates the Data to be used.
  code_name = c("mun_code")
  X_names = c("lnprice","kindergardens_1km","p_mig_west","nature","monuments","cafes_1km")
  Z_names = c("income","double_earner_hh","hh_kids","age", "migskill")
  s1.results <- first_stage(code_name, Z_names, X_names, data = municipality, print_detail = 1)

s1.results is a (maxLik) object, containing the estimation results and all relevant information. Several methods can be called on the object, including summary, coef, and stdEr. Calling summary provides a neat summary of the estimation, including the coefficients and std. errors of the cross-effect coefficients and the alternative-specific constants, which will later be used in the second stage of the estiamtion.

Second stage estimation

The function second_stage is used in order to estimate the first stage of the model. second_stage takes four arguments:

  • s1.results - which indicates the (maxLik) object estimation results of the first stage of the sorting model.
  • data - which indicates the Data to be used.
  • endog - (optional) which indicates the endogenous variable(s) to be instrumented (from the dataset in parentheses).
  • instr - (optional) which indicates the intrument(s) for the endogenous variable.
  s2.results <- second_stage(s1.results, data)
  
  endog <- ("lnprice")
  s2.results <- second_stage(s1.results, data = municipality, endog = endog, instr = instrument)
  summary(s2.results)

Before estimating the second stage of the model it is important to consider that some alternative characteristics are likely to be endogenous. For example, local price levels are very likely to be correlated with unobserved alternative features. For this reason the second_stage function allows (and encourages) users to include instruments in this part of the model estimation, using the options endog to indicate the endogenous variable(s), and instrument to indicate the chose instrument(s).

Calculating the market equilibrium instrument

sortingmod also includes the function sorting_inst, which generates an instrument which rises “naturally” from the model and utilizes its equilibrium properties, following Bayer et al. (2004). The calculated instrument is an instrument which clears the market assuming no correlation exists with unobserved heterogeneity between alternatives, and it is calculated in an interative process based on the results of the first stage of the estimation process.

  endog <- ("lnprice")
  phat <- sorting_inst(s1.results, endog, data = municipality, stepsize = 0.02)
## Warning: package 'bindrcpp' was built under R version 3.4.1

sorting_inst takes several (non-optional) arguments:

  • s1.results - which indicates the (maxLik) object estimation results of the first stage of the sorting model.
  • endog - which indicates the endogenous variable to be instrumented.
  • data Dataset to be used.

Unlike first_stage, sorting_inst does not take chosen alternatives, individual characteristics or alternative characteristics as arguments. This is because they are already specified as arguments in the s1.results object. However, note that data must indicate the same data used for the first stage estimation, and that the indicated endogenous variable must be included in the set of alternative characteristics used in the estimation (and specified in X_names).

sorting_inst returns a list which contains (1) Results of the IV estimation, with the computed vector as instrument for the endogenous variable. (2) a vector of the computed instrument, and (3) the correlation between the computed instrument and the original variable.

summary(phat$IV_results)
## 
## Call:
## ivreg(formula = formula_iv, data = data_alt, weights = 1/se.weights)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -25.39717  -3.63648   0.07974   4.27760  19.50532 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       322.6482    93.9647   3.434 0.000803 ***
## lnprice           -26.9416     7.7619  -3.471 0.000708 ***
## kindergardens_1km   0.3002     0.3559   0.843 0.400537    
## p_mig_west          0.5586     0.1965   2.843 0.005203 ** 
## nature              7.6055     3.5850   2.121 0.035810 *  
## monuments           0.9410     0.3745   2.513 0.013215 *  
## cafes_1km          -0.4035     0.1633  -2.470 0.014822 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.624 on 128 degrees of freedom
## Multiple R-Squared: -4.699,  Adjusted R-squared: -4.966 
## Wald test: 3.742 on 6 and 128 DF,  p-value: 0.001831

Notes:

  • The iteration process breaks if it identifies that values are not converging. Lack of convergence can be due to too much unobserved heterogeneity in the model, but it may also be that the contraction mapping coefficient (stepstize argument) is too large, which cause the iteration to overshoot. So before updating your model, try to experiment and choose lower stepsize values. In case convergence is not reached even though values are sufficiently close to zero, the threshold can be adjust by providing new input value using the threshold argument.
  • sorting_inst accepts only one argument for endogenous variable as input, and subsequently also returns only a single vector as instrument.

Marginal willingness-to-pay values

The sorting model allows calculating the marginal willingness to pay (MWTP) for the various alternative characteristics examined of each type of individual (with distinct observable characteristics). These MWTP values give a clear overview of the impact of different alternative characteristics on the location choice of heterogeneous individuals, with respect to the monetary value price of a standard house. With the full estimation results in hand, the computation of the marginal willingness to pay is relatively simple (see detailed explanation in Bayer et al (2004), Van Duijn and Rouwendal (2013)).

References

Bayer, P., McMillan, R., & Rueben, K. (2004). An equilibrium model of sorting in an urban housing market (No. w10865). National Bureau of Economic Research.

Berry, S., Levinsohn, J., & Pakes, A. (1995). Automobile prices in market equilibrium. Econometrica: Journal of the Econometric Society, 841-890.

van Duijn, M., Rouwendal, J., sep 2013. Cultural heritage and the location choice of Dutch households in a residential sorting model. Journal of Economic Geography 13 (3), 473–500