Thursday, July 21, 2016

Bayesian predicted slopes with interaction in multiple linear regression

Suppose we have a multiple linear regression with interaction: \[\hat{y} = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_{1\times 2} x_1 x_2 \] Notice that the slope on \(x_1\) is not just \(\beta_1\), it's \(\beta_1 + \beta_{1\times 2} x_2\): \[\hat{y} = \beta_0 + ( \beta_1 + \beta_{1\times 2} x_2) x_1 + \beta_2 x_2  \] In other words, the slope on \(x_1\) depends on the value of \(x_2\). That's the essence of interaction. Of course, the same applies analogously to the slope on \(x_2\).

This is explained in Ch. 18 of DBDA2E, and an example is given in Figure 18.9 on p. 529 that plots the predicted slopes and their 95% HDIs as a function of the value of the other predictor. The variable "Spend" is \(x_1\) and the variable "PrcntTake" is \(x_2\). The top panel (below) plots the posterior predicted slope on \(x_1\) for values of \(x_2\):
After Figure 18.9, p. 529, of DBDA2E

A reader asked me how to compute those predicted slopes in R and make the graph. I thought I already had the R code in the package of programs, but upon searching discovered that it was not there. (Sorry; with over 700 pages of text and many dozens of programs, some things get overlooked.) This blog post supplies the R code!  :-)

First, from the folder of programs, open the file Jags-Ymet-XmetMulti-Mrobust-Example.R Then, at the top of the file, comment out all the data-loading examples and put in the following code for reading in the relevant data and creating an interaction variable. This is explained on p. 528 of DBDA2E:

# Two predictors with interaction:
# Read the SAT data file:

myData = read.csv( file="Guber1999data.csv" )
# Append named column with interaction product: 

myData = cbind( myData , SpendXPrcnt=myData[,"Spend"]*myData[,"PrcntTake"] )
# Specify the column names:

yName = "SATT" ; xName = c("Spend","PrcntTake","SpendXPrcnt")
# Put the saved files into a sub-directory:

dirName = "Guber1999data-Jags-Inter"
if( !dir.exists(dirName) ) { dir.create(dirName) }
fileNameRoot = paste0(dirName,"/Guber1999data-Jags-Inter-")
# Specify the number of saved steps and the thinning (merely to save memory):

numSavedSteps=15000 ; thinSteps=20 # smaller thinSteps will run faster
# Specify the preferred graph file format:
graphFileType = "png"
Then run the MCMC, as already specified in the script. I'm repeating it here for completeness:

# Load the relevant model into R's working memory: 
# Generate the MCMC chain: 

mcmcCoda = genMCMC( data=myData , xName=xName , yName=yName ,
                    numSavedSteps=numSavedSteps , thinSteps=thinSteps ,
                    saveName=fileNameRoot )

The script also has commands for displaying MCMC diagnostics, numerical summary, and basic plots of the posterior distribution.

Now the novel R code for computing the slopes and making the graph! At the end of the script, run the following code:

# Convert mcmc coda object to a matrix: 

# Make combs of X1 and X2 values for displaying slopes and their HDIs: 

X1Comb = seq(min(myData[,xName[1]]),max(myData[,xName[1]]),length=20)
X2Comb = seq(min(myData[,xName[2]]),max(myData[,xName[2]]),length=20)

# Create place holders for X1 median slopes and HDIs: 

X1SlopeMed = rep(0,length(X2Comb))
X1SlopeHDI = matrix(0,nrow=2,ncol=length(X2Comb))
# Compute the slope on X1 for each value of X2,
# as explained at top of p. 530 of DBDA2E:

for ( combIdx in 1:length(X2Comb) ) {
  credSlope = mcmcMat[,"beta[1]"] + mcmcMat[,"beta[3]"]*X2Comb[combIdx]
  X1SlopeHDI[,combIdx] = HDIofMCMC( credSlope )
  X1SlopeMed[combIdx] = median( credSlope )

# Same as above but for X2:

X2SlopeHDI = matrix(0,nrow=2,ncol=length(X1Comb))
X2SlopeMed = rep(0,length(X1Comb))
for ( combIdx in 1:length(X1Comb) ) {
  credSlope = mcmcMat[,"beta[2]"] + mcmcMat[,"beta[3]"]*X1Comb[combIdx]
  X2SlopeHDI[,combIdx] = HDIofMCMC( credSlope )
  X2SlopeMed[combIdx] = median( credSlope )

# Make the graph as in lower part of Figure 18.9, p. 529 of DBDA2E:

par( mar=c(3.5,3.5,2,1) , mgp=c(2,0.7,0) )
# Panel to display slope on X1:

plot( X2Comb , X1SlopeMed , ylim=range(X1SlopeHDI) , 
      pch=19 , cex=1.5 , col="skyblue" ,
      ylab=paste("Slope on",xName[1]) , xlab=paste("Value of",xName[2]) ,

      cex.lab=1.5 )
segments( x0=X2Comb , y0=X1SlopeHDI[1,] ,x1=X2Comb , y1=X1SlopeHDI[2,] ,
          col="skyblue" , lwd=3 )
# Panel to display slope on X2: 

plot( X1Comb , X2SlopeMed , ylim=range(X2SlopeHDI) , pch=19 , cex=1.5 , 
      col="skyblue" ,
      ylab=paste("Slope on",xName[2]) , xlab=paste("Value of",xName[1]) , 

      cex.lab=1.5 )
segments( x0=X1Comb , y0=X2SlopeHDI[1,] ,x1=X1Comb , y1=X2SlopeHDI[2,] ,
          col="skyblue" , lwd=3 )
# Save it:

 saveGraph( file=paste0(fileNameRoot,"Slopes") , type=graphFileType )

Hopefully the comments in the script (above) are sufficient for you to decipher each step.

Monday, July 11, 2016

MCMC effective sample size for difference of parameters (in Bayesian posterior distribution)

We'd like the MCMC representation of a posterior distribution to have large effective sample size (ESS) for the relevant parameters. (I recommend ESS > 10,000 for reasonably stable estimates of the limits of the 95% highest density interval.) In many applications, we are interested in combinations of parameters, not just individual parameters. For example, in multi-group comparisons, we are interested in differences of the mean parameters. We might also be interested in the effect size (Cohen's d), which is the difference of mean parameters divided by the (pooled) standard deviation parameter. Unfortunately, the ESS of the combination of parameters might not be as high as the ESS's of the contributing parameters (or it might be higher!). This change in ESS is seen routinely in many results throughout DBDA2E.

This blog post shows a simple contrived example to demonstrate that the correlation of the contributing parameters is not necessarily linked to how much the ESS changes when the parameters are added or subtracted from each other. Consider estimating the means of two groups, with mean parameters, \(\mu_1\) and \(\mu_2\). We are interested in their difference, \(\mu_1 - \mu_2\). See the plots below for two cases in which the correlation of the mean parameters is the same, but their ESS's when combined are quite different.

In the left panel, ESS of the difference is much less than ESS of contributing parameters.
In the right panel, ESS of difference is same as ESS of contributing parameters.

P.S. This example was motivated by an exercise for one of the workshops I give.

Appendix: The R code.

# Contrived example to illustrate that ESS of combination of parameters
# can be different that ESS of individual parameters.
# First example.
nstep = 1000
x = rnorm(nstep)
xmy = x
xmy[2:nstep] = xmy[2:nstep] + 1.0 * xmy[1:(nstep-1)]
xpy = rnorm(length(x))
muOne = xpy - xmy
muTwo = xpy + xmy
# Plot result:
plot( muOne , muTwo , type="o" , asp=1 ,
      xlab=bquote(mu[1]) , ylab=bquote(mu[2]) ,
        ESS[mu[1]]==.(round(effectiveSize(muOne),0)) ,
        ESS[mu[2]]==.(round(effectiveSize(muTwo),0)) ) ,
        ESS[mu[1]-mu[2]]==.(round(effectiveSize(muOne-muTwo),0)) ,
        ESS[mu[1]+mu[2]]==.(round(effectiveSize(muOne+muTwo),0)) )
      )) ,
      cex.lab=1.5 , cex.main=1.0 )
text( min(muOne) , min(muTwo) , adj=c(0,0) ,
      labels=bquote(r[list(mu[1],mu[2])]==.(round(cor(muOne,muTwo),2))) )
# Second example:
muOne = rnorm(nstep)
muTwo = -0.35*muOne + rnorm(nstep)
# Plot result:
plot( muOne , muTwo , type="o" , asp=1 ,
      xlab=bquote(mu[1]) , ylab=bquote(mu[2]) ,
          ESS[mu[1]]==.(round(effectiveSize(muOne),0)) ,
          ESS[mu[2]]==.(round(effectiveSize(muTwo),0)) ) ,
          ESS[mu[1]-mu[2]]==.(round(effectiveSize(muOne-muTwo),0)) ,
          ESS[mu[1]+mu[2]]==.(round(effectiveSize(muOne+muTwo),0)) )
      )) ,
      cex.lab=1.5 , cex.main=1.0 )
text( min(muOne) , min(muTwo) , adj=c(0,0) ,
      labels=bquote(r[list(mu[1],mu[2])]==.(round(cor(muOne,muTwo),2))) )

Sunday, July 10, 2016

Bayesian variable selection in multiple linear regression: Model with highest R^2 is not necessarily highest posterior probability

Chapter 18 of DBDA2E includes sections on Bayesian variable selection in multiple linear regression. The idea is that each predictor (a.k.a., "variable") has an inclusion coefficient \(\delta_j\) that can be 0 or 1 (along with its regression coefficient, \(\beta_j\)). Each combination of predictors is a different "model" of the predicted variable. The plots in DBDA2E show the posterior probabilities of combinations of included variables, but the plots do not show the posterior distribution of \(R^2\) for each combination. This blog post shows plots with \(R^2\) and, in turn, emphasizes that the model with the highest \(R^2\) is not necessarily the model with the highest posterior probability. That preference for a parsimonious model --- not only the best fitting model ---  is a signature of the automatic penalty for complexity provided by Bayesian model comparison.

In this example, the mean Scholastic Aptitude Test (SAT) score for each state of the U.S. is the predicted variable, and the four candidate predictors are spending per student (by the state), percentage of students taking the exam, student/teacher ratio, and teacher salary. Full details are provided in Chapter 18 of DBDA2E.

Here (below) are plots of the posterior distributions for the three most probable models, and the model with all predictors included. There are four rows, corresponding to four different combinations of included variables. The posterior probability of each combination is shown at the top-left of each panel, as "Model Prob".

You can see in the second row that the model that includes only the second predictor, PrcntTake, has a posterior probability of 0.235 and an \(R^2\) of 0.784. The model in the fourth row includes all four predictors and has a higher \(R^2\) of 0.826 but a much smaller posterior probability of 0.021. (The posterior distributions of low probability models are choppy because the MCMC chain visits them rarely.)

Thanks to Dr. Renny Maduro for suggesting the inclusion of \(R^2\) in the plots. Dr. Maduro attended my recent workshop at ICPSR in Ann Arbor, Michigan.

Modified code for the graphs of this example is posted at the book's web site.

Monday, June 27, 2016

Brexit: "Bayesian" statistics renamed "Laplacian" statistics

With the U.K. leaving the E.U., it's time for "Bayesian" to exit its titular role and be replaced by "Laplacian".  


Various historians (e.g., Dale, 1999; McGrayne, 2011; as cited in DBDA2E) have argued that despite Bayes and Price having published the basic formula first, it was Laplace who did all the heavy lifting to develop the methods. Laplace was French, while Bayes and Price were English. 

On the other hand, Bayes did attend the University of Edinburgh, and lived around London, so maybe he would have voted to stay!

This humorous implication of Brexit was suggested by a participant at a workshop I presented in St. Gallen, Switzerland, last week, after I briefly described the history above and showed the book photographed at Bayes' tomb and at Laplace's memorial. I have forgotten who it was that made the joke. If the originator of the joke is reading this, please put a comment below or send me an email!

Thursday, June 23, 2016

Fixing the intercept at zero in Bayesian linear regression

 In DBDA2E and in workshops, I present an example of simple linear regression: predicting an adult's weight from his/her height. A participant at a recent workshop suggested that maybe the y-intercept should be fixed at zero, because a person of zero height should have zero weight. I replied that the linear trend is really only intended as a description over the range of reasonable adult heights, not to be extrapolated all the way to a height of zero. Nevertheless, in principle it would be easy to make the restriction in the JAGS model specification. But then it was pointed out to me that the JAGS model specification in DBDA2E standardizes the variables -- to make the MCMC more efficient -- and setting the y intercept of the standardized y to zero is (of course) not the same as setting the y intercept of the raw scale to zero. This blog post shows how to set the y intercept on the raw scale to zero.

For reference, here (below) is the original model specification that does not fix the y intercept at zero. (This is in file Jags-Ymet-Xmet-Mrobust.R.) Notice two lines below in bold font and yellow highlight, regarding the prior on the standardized intercept zbeta0 and the transformation back to the raw-scale intercept beta0:

  # Standardize the data:
  data {
    Ntotal <- length(y)
    xm <- mean(x)
    ym <- mean(y)
    xsd <- sd(x)
    ysd <- sd(y)
    for ( i in 1:length(y) ) {
      zx[i] <- ( x[i] - xm ) / xsd
      zy[i] <- ( y[i] - ym ) / ysd
  # Specify the model for standardized data:
  model {
    for ( i in 1:Ntotal ) {
      zy[i] ~ dt( zbeta0 + zbeta1 * zx[i] , 1/zsigma^2 , nu )
    # Priors vague on standardized scale:
    zbeta0 ~ dnorm( 0 , 1/(10)^2 ) 
    zbeta1 ~ dnorm( 0 , 1/(10)^2 )
    zsigma ~ dunif( 1.0E-3 , 1.0E+3 )
    nu ~ dexp(1/30.0)
    # Transform to original scale:
    beta1 <- zbeta1 * ysd / xsd 
    beta0 <- zbeta0 * ysd  + ym - zbeta1 * xm * ysd / xsd
    sigma <- zsigma * ysd

Now how to fix beta0 to zero: Instead of putting a broad prior distribution on zbeta0, it is set to whatever value would make raw-scale beta0 equal zero. To determine that corresponding value of zbeta0, we use the transform-to-raw-scale formula for beta0, shown at the end of the model specification above. In that formula, set beta0 to 0, then solve for zbeta0. Replace the prior on zbeta0 as follows:

    # zbeta0 ~ dnorm( 0 , 1/(10)^2 ) 
    # Fix y intercept at zero:
    zbeta0 <- zbeta1*xm/xsd - ym/ysd

When run on the height/weight data (using script Jags-Ymet-Xmet-Mrobust-Example.R), a smattering of regression lines from the posterior distribution looks like this:

You can see that the y-intercept has indeed been fixed at zero for all the regression lines.

One crucial trick to getting the graphs to display: In the plotMCMC command of Jags-Ymet-Xmet-Mrobust-Example.R, set showCurve to TRUE:

plotMCMC( mcmcCoda , data=myData , xName=xName , yName=yName ,
          compValBeta1=0.0 , ropeBeta1=c(-0.5,0.5) ,
          pairsPlot=TRUE , showCurve=TRUE ,
          saveName=fileNameRoot , saveType=graphFileType )

If showCurve=FALSE, then it tries to make a histogram which fails for the weird values of beta0 generated by the MCMC.

Monday, May 30, 2016

Some Bayesian approaches to replication analysis and planning (talk video)

Some Bayesian approaches to replication analysis and planning. A talk presented at the Association for Psychological Science, Friday May 27, 2016. This video (below) is a pre-recording while sitting at my desk; the actual talk included some spontaneous additions about relations of the material to previous speakers' talks, and a few attempts at you-had-to-be-there humor.
If you have comments or questions about the talk, please post them with the video on YouTube, not here.

Here's a snapshot of the speakers:
Left to right: Joe Rodgers, John Kruschke, Larry Hedges, Pat Shrout (symposium organizer), and Scott Maxwell.

Sunday, May 15, 2016

Bayesian inference in the (abnormal) mind

The (abnormal) mind can be modeled as a Bayesian inference engine, as summarized in the post, Bayesian reasoning implicated in some mental disorders. Excerpt:
“The brain is a guessing machine [i.e., Bayesian inference engine - JKK], trying at each moment of time to guess what is out there,” says computational neuroscientist Peggy Seriès. Guesses just slightly off — like mistaking a smile for a smirk — rarely cause harm. But guessing gone seriously awry may play a part in mental illnesses such as schizophrenia, autism and even anxiety disorders, Seriès and other neuroscientists suspect. They say that a mathematical expression known as Bayes’ theorem — which quantifies how prior expectations can be combined with current evidence — may provide novel insights into pernicious mental problems that have so far defied explanation.
For a tutorial about Bayesian models of perception and cognition, see Bayesian learning theory applied to human cognition.

Note that Bayesian modeling of data is a richly valuable approach regardless of whether any particular Bayesian model of mind is accurate. See this brief blog post for the distinction between (Bayesian) descriptive models of data, psychometric models, and models of mind.