In linear regression, why should we include quadratic terms when we are only interested in interaction terms?











up vote
9
down vote

favorite
1












Suppose I am interested in a linear regression model, for $$Y_i = beta_0 + beta_1x_1 + beta_2x_2 + beta_3x_1x_2$$, because I would like to see if an interaction between the two covariates have an effect on Y.



In a professors' course notes (whom I do not have contact with), it states:
When including interaction terms, you should include their second degree terms. ie $$Y_i = beta_0 + beta_1x_1 + beta_2x_2 + beta_3x_1x_2 +beta_4x_1^2 + beta_5x_2^2$$ should be included in the regression.



Why should one include second degree terms when we are only interested in the interactions?










share|cite|improve this question




















  • 7




    If model has $x_1x_2$, it should include $x_1$ and $x_2$. But $x_1^2$ and $x_2^2$ are optional.
    – user158565
    Dec 1 at 23:30






  • 5




    Your professor's opinion seems to be unusual. It might stem from a specialized background or set of experiences, because "should" is definitely not a universal requirement. You might find stats.stackexchange.com/questions/11009 to be of some interest.
    – whuber
    Dec 1 at 23:32












  • @user158565 hi! May I ask why we should also include $x_1$ and $x_2$? I did not originally think of that, but now that you mentioned it..!
    – Kevin C
    Dec 1 at 23:34












  • @whuber hi! Thanks for the link! I think including the main effect makes sense, but I have trouble extending that to having to include second order terms. // user158565 I think the link above answered that, thank you!
    – Kevin C
    Dec 1 at 23:45










  • Would you please post a link to the data?
    – James Phillips
    Dec 2 at 0:46















up vote
9
down vote

favorite
1












Suppose I am interested in a linear regression model, for $$Y_i = beta_0 + beta_1x_1 + beta_2x_2 + beta_3x_1x_2$$, because I would like to see if an interaction between the two covariates have an effect on Y.



In a professors' course notes (whom I do not have contact with), it states:
When including interaction terms, you should include their second degree terms. ie $$Y_i = beta_0 + beta_1x_1 + beta_2x_2 + beta_3x_1x_2 +beta_4x_1^2 + beta_5x_2^2$$ should be included in the regression.



Why should one include second degree terms when we are only interested in the interactions?










share|cite|improve this question




















  • 7




    If model has $x_1x_2$, it should include $x_1$ and $x_2$. But $x_1^2$ and $x_2^2$ are optional.
    – user158565
    Dec 1 at 23:30






  • 5




    Your professor's opinion seems to be unusual. It might stem from a specialized background or set of experiences, because "should" is definitely not a universal requirement. You might find stats.stackexchange.com/questions/11009 to be of some interest.
    – whuber
    Dec 1 at 23:32












  • @user158565 hi! May I ask why we should also include $x_1$ and $x_2$? I did not originally think of that, but now that you mentioned it..!
    – Kevin C
    Dec 1 at 23:34












  • @whuber hi! Thanks for the link! I think including the main effect makes sense, but I have trouble extending that to having to include second order terms. // user158565 I think the link above answered that, thank you!
    – Kevin C
    Dec 1 at 23:45










  • Would you please post a link to the data?
    – James Phillips
    Dec 2 at 0:46













up vote
9
down vote

favorite
1









up vote
9
down vote

favorite
1






1





Suppose I am interested in a linear regression model, for $$Y_i = beta_0 + beta_1x_1 + beta_2x_2 + beta_3x_1x_2$$, because I would like to see if an interaction between the two covariates have an effect on Y.



In a professors' course notes (whom I do not have contact with), it states:
When including interaction terms, you should include their second degree terms. ie $$Y_i = beta_0 + beta_1x_1 + beta_2x_2 + beta_3x_1x_2 +beta_4x_1^2 + beta_5x_2^2$$ should be included in the regression.



Why should one include second degree terms when we are only interested in the interactions?










share|cite|improve this question















Suppose I am interested in a linear regression model, for $$Y_i = beta_0 + beta_1x_1 + beta_2x_2 + beta_3x_1x_2$$, because I would like to see if an interaction between the two covariates have an effect on Y.



In a professors' course notes (whom I do not have contact with), it states:
When including interaction terms, you should include their second degree terms. ie $$Y_i = beta_0 + beta_1x_1 + beta_2x_2 + beta_3x_1x_2 +beta_4x_1^2 + beta_5x_2^2$$ should be included in the regression.



Why should one include second degree terms when we are only interested in the interactions?







regression multiple-regression interaction linear-model






share|cite|improve this question















share|cite|improve this question













share|cite|improve this question




share|cite|improve this question








edited Dec 3 at 5:42

























asked Dec 1 at 23:25









Kevin C

686




686








  • 7




    If model has $x_1x_2$, it should include $x_1$ and $x_2$. But $x_1^2$ and $x_2^2$ are optional.
    – user158565
    Dec 1 at 23:30






  • 5




    Your professor's opinion seems to be unusual. It might stem from a specialized background or set of experiences, because "should" is definitely not a universal requirement. You might find stats.stackexchange.com/questions/11009 to be of some interest.
    – whuber
    Dec 1 at 23:32












  • @user158565 hi! May I ask why we should also include $x_1$ and $x_2$? I did not originally think of that, but now that you mentioned it..!
    – Kevin C
    Dec 1 at 23:34












  • @whuber hi! Thanks for the link! I think including the main effect makes sense, but I have trouble extending that to having to include second order terms. // user158565 I think the link above answered that, thank you!
    – Kevin C
    Dec 1 at 23:45










  • Would you please post a link to the data?
    – James Phillips
    Dec 2 at 0:46














  • 7




    If model has $x_1x_2$, it should include $x_1$ and $x_2$. But $x_1^2$ and $x_2^2$ are optional.
    – user158565
    Dec 1 at 23:30






  • 5




    Your professor's opinion seems to be unusual. It might stem from a specialized background or set of experiences, because "should" is definitely not a universal requirement. You might find stats.stackexchange.com/questions/11009 to be of some interest.
    – whuber
    Dec 1 at 23:32












  • @user158565 hi! May I ask why we should also include $x_1$ and $x_2$? I did not originally think of that, but now that you mentioned it..!
    – Kevin C
    Dec 1 at 23:34












  • @whuber hi! Thanks for the link! I think including the main effect makes sense, but I have trouble extending that to having to include second order terms. // user158565 I think the link above answered that, thank you!
    – Kevin C
    Dec 1 at 23:45










  • Would you please post a link to the data?
    – James Phillips
    Dec 2 at 0:46








7




7




If model has $x_1x_2$, it should include $x_1$ and $x_2$. But $x_1^2$ and $x_2^2$ are optional.
– user158565
Dec 1 at 23:30




If model has $x_1x_2$, it should include $x_1$ and $x_2$. But $x_1^2$ and $x_2^2$ are optional.
– user158565
Dec 1 at 23:30




5




5




Your professor's opinion seems to be unusual. It might stem from a specialized background or set of experiences, because "should" is definitely not a universal requirement. You might find stats.stackexchange.com/questions/11009 to be of some interest.
– whuber
Dec 1 at 23:32






Your professor's opinion seems to be unusual. It might stem from a specialized background or set of experiences, because "should" is definitely not a universal requirement. You might find stats.stackexchange.com/questions/11009 to be of some interest.
– whuber
Dec 1 at 23:32














@user158565 hi! May I ask why we should also include $x_1$ and $x_2$? I did not originally think of that, but now that you mentioned it..!
– Kevin C
Dec 1 at 23:34






@user158565 hi! May I ask why we should also include $x_1$ and $x_2$? I did not originally think of that, but now that you mentioned it..!
– Kevin C
Dec 1 at 23:34














@whuber hi! Thanks for the link! I think including the main effect makes sense, but I have trouble extending that to having to include second order terms. // user158565 I think the link above answered that, thank you!
– Kevin C
Dec 1 at 23:45




@whuber hi! Thanks for the link! I think including the main effect makes sense, but I have trouble extending that to having to include second order terms. // user158565 I think the link above answered that, thank you!
– Kevin C
Dec 1 at 23:45












Would you please post a link to the data?
– James Phillips
Dec 2 at 0:46




Would you please post a link to the data?
– James Phillips
Dec 2 at 0:46










2 Answers
2






active

oldest

votes

















up vote
7
down vote













It depends on the goal of inference. If you want to make inference of whether there exists an interaction, for instance, in a causal context (or, more generally, if you want to interpret the interaction coefficient), this recommendation from your professor does make sense, and it comes from the fact that misspecification of the functional form can lead to wrong inferences about interaction.



Here is a simple example where there is no interaction term between $x_1$ and $x_2$ in the structural equation of $y$, yet, if you do not include the quadratic term of $x_1$, you would wrongly conclude that $x_1$ interacts with $x_2$ when in fact it doesn't.



set.seed(10)
n <- 1e3
x1 <- rnorm(n)
x2 <- x1 + rnorm(n)
y <- x1 + x2 + x1^2 + rnorm(n)
summary(lm(y ~ x1 + x2 + x1:x2))

Call:
lm(formula = y ~ x1 + x2 + x1:x2)

Residuals:
Min 1Q Median 3Q Max
-3.7781 -0.8326 -0.0806 0.7598 7.7929

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.30116 0.04813 6.257 5.81e-10 ***
x1 1.03142 0.05888 17.519 < 2e-16 ***
x2 1.01806 0.03971 25.638 < 2e-16 ***
x1:x2 0.63939 0.02390 26.757 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.308 on 996 degrees of freedom
Multiple R-squared: 0.7935, Adjusted R-squared: 0.7929
F-statistic: 1276 on 3 and 996 DF, p-value: < 2.2e-16


This can be interpreted as simply a case of omitted variable bias, and here $x_1^2$ is the omitted variable. If you go back and include the squared term in your regression, the apparent interaction disappears.



summary(lm(y ~ x1 + x2 + x1:x2 + I(x1^2)))   

Call:
lm(formula = y ~ x1 + x2 + x1:x2 + I(x1^2))

Residuals:
Min 1Q Median 3Q Max
-3.4574 -0.7073 0.0228 0.6723 3.7135

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.0419958 0.0398423 -1.054 0.292
x1 1.0296642 0.0458586 22.453 <2e-16 ***
x2 1.0017625 0.0309367 32.381 <2e-16 ***
I(x1^2) 1.0196002 0.0400940 25.430 <2e-16 ***
x1:x2 -0.0006889 0.0313045 -0.022 0.982
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.019 on 995 degrees of freedom
Multiple R-squared: 0.8748, Adjusted R-squared: 0.8743
F-statistic: 1739 on 4 and 995 DF, p-value: < 2.2e-16


Of course, this reasoning applies not only to quadratic terms, but misspecification of the functional form in general. The goal here is to model the conditional expectation function appropriately to assess interaction. If you are limiting yourself to modeling with linear regression, then you will need to include these nonlinear terms manually. But an alternative is to use more flexible regression modeling, such as kernel ridge regression for instance.






share|cite|improve this answer























  • Thank you @CarlosCinelli , in conclusion, are you saying we should include terms of same degree - to account for potential misspecification of the functional form - and let the regression determine which terms are significant?
    – Kevin C
    Dec 2 at 19:56






  • 3




    @KevinC the main question here is: do you want to interpret the interaction term? If you do, then misspecification of the functional form is a real issue. Adding quadratic terms is just one simple way of capturing non-linearities, but the general issue is modeling the conditional expectation function appropriately.
    – Carlos Cinelli
    Dec 2 at 20:17








  • 1




    Please do not include rm(list=ls()) in code posted here! If people just copy&paste and run the code, they could get a surprise ... I removed it for now.
    – kjetil b halvorsen
    Dec 2 at 22:34


















up vote
3
down vote













The two models you listed in your answer can be re-expressed to make it clear how the effect of $X_1$ is postulated to depend on $X_2$ (or the other way around) in each model.



The first model can be re-expressed like this:



$$Y = beta_0 + (beta_1 + beta_3X_2)X_1 + beta_2X_2+ epsilon,$$



which shows that, in this model, $X1$ is assumed to have a linear effect on $Y$ (controlling for the effect of $X_2$) but the the magnitude of this linear effect - captured by the slope coefficient of $X_1$ - changes linearly as a function of $X_2$. For example, the effect of $X_1$ on $Y$ may increase in magnitude as the values of $X_2$ increase.



The second model can be re-expressed like this:



$$Y = beta_0 + (beta_1 + beta_3X_2)X_1 + beta_4 X_1^2 + beta_2X_2 +beta_5X_2^2 + epsilon,$$



which shows that, in this model, the effect of $X_1$ on $Y$ (controlling for the effect of $X_2$) is assumed to be quadratic rather than linear. This quadratic effect is captured by including both $X_1$ and $X_1^2$ in the model. While the coefficient of $X_1^2$ is assumed to be independent of $X_2$, the coefficient of $X_1$ is assumed to depend linearly on $X_2$.



Using either model would imply that you are making entirely different assumptions about the nature of the effect of $X_1$ on $Y$ (controlling for the effect of $X_2$).



Usually, people fit the first model. They might then plot the residuals from that model against $X_1$ and $X_2$ in turns. If the residuals reveal a quadratic pattern in the residuals as a function of $X_1$ and/or $X_2$, the model can be augmented accordingly so that it includes $X_1^2$ and/or $X_2^2$ (and possibly their interaction).



Note that I simplified the notation you used for consistency and also made ther error term explicit in both models.






share|cite|improve this answer



















  • 2




    Hi @IsabellaGhement , thank you for your explanation. In summary, there are really no "rules" in that we should add quadratic terms if we include interaction terms. At the end of the day, it comes back to the assumptions we are making about our model, and the results of our analysis (ie. residual plots). Is this correct? Thanks again :)!
    – Kevin C
    Dec 2 at 20:13






  • 2




    That's right, Kevin! There are no "rules", because each data set is different and is also meant to answer different questions. That is why it is important for us to be aware that each model we fit to that data set implies different assumptions, which need to be supported by the data for us to trust the model results. The model diagnostic plots (e.g., plot of residuals vs. fitted values) help us verify to what extent - if any - the data support the model assumptions.
    – Isabella Ghement
    Dec 2 at 21:44








  • 1




    Understood, thank you Isabella. Happy holidays!
    – Kevin C
    Dec 8 at 7:50






  • 1




    @KevinC: Great! Happy holidays to you too, Kevin! ☃🎉🎁🎈
    – Isabella Ghement
    Dec 8 at 14:52











Your Answer





StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "65"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f379841%2fin-linear-regression-why-should-we-include-quadratic-terms-when-we-are-only-int%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























2 Answers
2






active

oldest

votes








2 Answers
2






active

oldest

votes









active

oldest

votes






active

oldest

votes








up vote
7
down vote













It depends on the goal of inference. If you want to make inference of whether there exists an interaction, for instance, in a causal context (or, more generally, if you want to interpret the interaction coefficient), this recommendation from your professor does make sense, and it comes from the fact that misspecification of the functional form can lead to wrong inferences about interaction.



Here is a simple example where there is no interaction term between $x_1$ and $x_2$ in the structural equation of $y$, yet, if you do not include the quadratic term of $x_1$, you would wrongly conclude that $x_1$ interacts with $x_2$ when in fact it doesn't.



set.seed(10)
n <- 1e3
x1 <- rnorm(n)
x2 <- x1 + rnorm(n)
y <- x1 + x2 + x1^2 + rnorm(n)
summary(lm(y ~ x1 + x2 + x1:x2))

Call:
lm(formula = y ~ x1 + x2 + x1:x2)

Residuals:
Min 1Q Median 3Q Max
-3.7781 -0.8326 -0.0806 0.7598 7.7929

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.30116 0.04813 6.257 5.81e-10 ***
x1 1.03142 0.05888 17.519 < 2e-16 ***
x2 1.01806 0.03971 25.638 < 2e-16 ***
x1:x2 0.63939 0.02390 26.757 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.308 on 996 degrees of freedom
Multiple R-squared: 0.7935, Adjusted R-squared: 0.7929
F-statistic: 1276 on 3 and 996 DF, p-value: < 2.2e-16


This can be interpreted as simply a case of omitted variable bias, and here $x_1^2$ is the omitted variable. If you go back and include the squared term in your regression, the apparent interaction disappears.



summary(lm(y ~ x1 + x2 + x1:x2 + I(x1^2)))   

Call:
lm(formula = y ~ x1 + x2 + x1:x2 + I(x1^2))

Residuals:
Min 1Q Median 3Q Max
-3.4574 -0.7073 0.0228 0.6723 3.7135

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.0419958 0.0398423 -1.054 0.292
x1 1.0296642 0.0458586 22.453 <2e-16 ***
x2 1.0017625 0.0309367 32.381 <2e-16 ***
I(x1^2) 1.0196002 0.0400940 25.430 <2e-16 ***
x1:x2 -0.0006889 0.0313045 -0.022 0.982
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.019 on 995 degrees of freedom
Multiple R-squared: 0.8748, Adjusted R-squared: 0.8743
F-statistic: 1739 on 4 and 995 DF, p-value: < 2.2e-16


Of course, this reasoning applies not only to quadratic terms, but misspecification of the functional form in general. The goal here is to model the conditional expectation function appropriately to assess interaction. If you are limiting yourself to modeling with linear regression, then you will need to include these nonlinear terms manually. But an alternative is to use more flexible regression modeling, such as kernel ridge regression for instance.






share|cite|improve this answer























  • Thank you @CarlosCinelli , in conclusion, are you saying we should include terms of same degree - to account for potential misspecification of the functional form - and let the regression determine which terms are significant?
    – Kevin C
    Dec 2 at 19:56






  • 3




    @KevinC the main question here is: do you want to interpret the interaction term? If you do, then misspecification of the functional form is a real issue. Adding quadratic terms is just one simple way of capturing non-linearities, but the general issue is modeling the conditional expectation function appropriately.
    – Carlos Cinelli
    Dec 2 at 20:17








  • 1




    Please do not include rm(list=ls()) in code posted here! If people just copy&paste and run the code, they could get a surprise ... I removed it for now.
    – kjetil b halvorsen
    Dec 2 at 22:34















up vote
7
down vote













It depends on the goal of inference. If you want to make inference of whether there exists an interaction, for instance, in a causal context (or, more generally, if you want to interpret the interaction coefficient), this recommendation from your professor does make sense, and it comes from the fact that misspecification of the functional form can lead to wrong inferences about interaction.



Here is a simple example where there is no interaction term between $x_1$ and $x_2$ in the structural equation of $y$, yet, if you do not include the quadratic term of $x_1$, you would wrongly conclude that $x_1$ interacts with $x_2$ when in fact it doesn't.



set.seed(10)
n <- 1e3
x1 <- rnorm(n)
x2 <- x1 + rnorm(n)
y <- x1 + x2 + x1^2 + rnorm(n)
summary(lm(y ~ x1 + x2 + x1:x2))

Call:
lm(formula = y ~ x1 + x2 + x1:x2)

Residuals:
Min 1Q Median 3Q Max
-3.7781 -0.8326 -0.0806 0.7598 7.7929

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.30116 0.04813 6.257 5.81e-10 ***
x1 1.03142 0.05888 17.519 < 2e-16 ***
x2 1.01806 0.03971 25.638 < 2e-16 ***
x1:x2 0.63939 0.02390 26.757 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.308 on 996 degrees of freedom
Multiple R-squared: 0.7935, Adjusted R-squared: 0.7929
F-statistic: 1276 on 3 and 996 DF, p-value: < 2.2e-16


This can be interpreted as simply a case of omitted variable bias, and here $x_1^2$ is the omitted variable. If you go back and include the squared term in your regression, the apparent interaction disappears.



summary(lm(y ~ x1 + x2 + x1:x2 + I(x1^2)))   

Call:
lm(formula = y ~ x1 + x2 + x1:x2 + I(x1^2))

Residuals:
Min 1Q Median 3Q Max
-3.4574 -0.7073 0.0228 0.6723 3.7135

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.0419958 0.0398423 -1.054 0.292
x1 1.0296642 0.0458586 22.453 <2e-16 ***
x2 1.0017625 0.0309367 32.381 <2e-16 ***
I(x1^2) 1.0196002 0.0400940 25.430 <2e-16 ***
x1:x2 -0.0006889 0.0313045 -0.022 0.982
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.019 on 995 degrees of freedom
Multiple R-squared: 0.8748, Adjusted R-squared: 0.8743
F-statistic: 1739 on 4 and 995 DF, p-value: < 2.2e-16


Of course, this reasoning applies not only to quadratic terms, but misspecification of the functional form in general. The goal here is to model the conditional expectation function appropriately to assess interaction. If you are limiting yourself to modeling with linear regression, then you will need to include these nonlinear terms manually. But an alternative is to use more flexible regression modeling, such as kernel ridge regression for instance.






share|cite|improve this answer























  • Thank you @CarlosCinelli , in conclusion, are you saying we should include terms of same degree - to account for potential misspecification of the functional form - and let the regression determine which terms are significant?
    – Kevin C
    Dec 2 at 19:56






  • 3




    @KevinC the main question here is: do you want to interpret the interaction term? If you do, then misspecification of the functional form is a real issue. Adding quadratic terms is just one simple way of capturing non-linearities, but the general issue is modeling the conditional expectation function appropriately.
    – Carlos Cinelli
    Dec 2 at 20:17








  • 1




    Please do not include rm(list=ls()) in code posted here! If people just copy&paste and run the code, they could get a surprise ... I removed it for now.
    – kjetil b halvorsen
    Dec 2 at 22:34













up vote
7
down vote










up vote
7
down vote









It depends on the goal of inference. If you want to make inference of whether there exists an interaction, for instance, in a causal context (or, more generally, if you want to interpret the interaction coefficient), this recommendation from your professor does make sense, and it comes from the fact that misspecification of the functional form can lead to wrong inferences about interaction.



Here is a simple example where there is no interaction term between $x_1$ and $x_2$ in the structural equation of $y$, yet, if you do not include the quadratic term of $x_1$, you would wrongly conclude that $x_1$ interacts with $x_2$ when in fact it doesn't.



set.seed(10)
n <- 1e3
x1 <- rnorm(n)
x2 <- x1 + rnorm(n)
y <- x1 + x2 + x1^2 + rnorm(n)
summary(lm(y ~ x1 + x2 + x1:x2))

Call:
lm(formula = y ~ x1 + x2 + x1:x2)

Residuals:
Min 1Q Median 3Q Max
-3.7781 -0.8326 -0.0806 0.7598 7.7929

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.30116 0.04813 6.257 5.81e-10 ***
x1 1.03142 0.05888 17.519 < 2e-16 ***
x2 1.01806 0.03971 25.638 < 2e-16 ***
x1:x2 0.63939 0.02390 26.757 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.308 on 996 degrees of freedom
Multiple R-squared: 0.7935, Adjusted R-squared: 0.7929
F-statistic: 1276 on 3 and 996 DF, p-value: < 2.2e-16


This can be interpreted as simply a case of omitted variable bias, and here $x_1^2$ is the omitted variable. If you go back and include the squared term in your regression, the apparent interaction disappears.



summary(lm(y ~ x1 + x2 + x1:x2 + I(x1^2)))   

Call:
lm(formula = y ~ x1 + x2 + x1:x2 + I(x1^2))

Residuals:
Min 1Q Median 3Q Max
-3.4574 -0.7073 0.0228 0.6723 3.7135

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.0419958 0.0398423 -1.054 0.292
x1 1.0296642 0.0458586 22.453 <2e-16 ***
x2 1.0017625 0.0309367 32.381 <2e-16 ***
I(x1^2) 1.0196002 0.0400940 25.430 <2e-16 ***
x1:x2 -0.0006889 0.0313045 -0.022 0.982
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.019 on 995 degrees of freedom
Multiple R-squared: 0.8748, Adjusted R-squared: 0.8743
F-statistic: 1739 on 4 and 995 DF, p-value: < 2.2e-16


Of course, this reasoning applies not only to quadratic terms, but misspecification of the functional form in general. The goal here is to model the conditional expectation function appropriately to assess interaction. If you are limiting yourself to modeling with linear regression, then you will need to include these nonlinear terms manually. But an alternative is to use more flexible regression modeling, such as kernel ridge regression for instance.






share|cite|improve this answer














It depends on the goal of inference. If you want to make inference of whether there exists an interaction, for instance, in a causal context (or, more generally, if you want to interpret the interaction coefficient), this recommendation from your professor does make sense, and it comes from the fact that misspecification of the functional form can lead to wrong inferences about interaction.



Here is a simple example where there is no interaction term between $x_1$ and $x_2$ in the structural equation of $y$, yet, if you do not include the quadratic term of $x_1$, you would wrongly conclude that $x_1$ interacts with $x_2$ when in fact it doesn't.



set.seed(10)
n <- 1e3
x1 <- rnorm(n)
x2 <- x1 + rnorm(n)
y <- x1 + x2 + x1^2 + rnorm(n)
summary(lm(y ~ x1 + x2 + x1:x2))

Call:
lm(formula = y ~ x1 + x2 + x1:x2)

Residuals:
Min 1Q Median 3Q Max
-3.7781 -0.8326 -0.0806 0.7598 7.7929

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.30116 0.04813 6.257 5.81e-10 ***
x1 1.03142 0.05888 17.519 < 2e-16 ***
x2 1.01806 0.03971 25.638 < 2e-16 ***
x1:x2 0.63939 0.02390 26.757 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.308 on 996 degrees of freedom
Multiple R-squared: 0.7935, Adjusted R-squared: 0.7929
F-statistic: 1276 on 3 and 996 DF, p-value: < 2.2e-16


This can be interpreted as simply a case of omitted variable bias, and here $x_1^2$ is the omitted variable. If you go back and include the squared term in your regression, the apparent interaction disappears.



summary(lm(y ~ x1 + x2 + x1:x2 + I(x1^2)))   

Call:
lm(formula = y ~ x1 + x2 + x1:x2 + I(x1^2))

Residuals:
Min 1Q Median 3Q Max
-3.4574 -0.7073 0.0228 0.6723 3.7135

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.0419958 0.0398423 -1.054 0.292
x1 1.0296642 0.0458586 22.453 <2e-16 ***
x2 1.0017625 0.0309367 32.381 <2e-16 ***
I(x1^2) 1.0196002 0.0400940 25.430 <2e-16 ***
x1:x2 -0.0006889 0.0313045 -0.022 0.982
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.019 on 995 degrees of freedom
Multiple R-squared: 0.8748, Adjusted R-squared: 0.8743
F-statistic: 1739 on 4 and 995 DF, p-value: < 2.2e-16


Of course, this reasoning applies not only to quadratic terms, but misspecification of the functional form in general. The goal here is to model the conditional expectation function appropriately to assess interaction. If you are limiting yourself to modeling with linear regression, then you will need to include these nonlinear terms manually. But an alternative is to use more flexible regression modeling, such as kernel ridge regression for instance.







share|cite|improve this answer














share|cite|improve this answer



share|cite|improve this answer








edited Dec 2 at 22:35









kjetil b halvorsen

28.2k980205




28.2k980205










answered Dec 2 at 1:15









Carlos Cinelli

5,99442352




5,99442352












  • Thank you @CarlosCinelli , in conclusion, are you saying we should include terms of same degree - to account for potential misspecification of the functional form - and let the regression determine which terms are significant?
    – Kevin C
    Dec 2 at 19:56






  • 3




    @KevinC the main question here is: do you want to interpret the interaction term? If you do, then misspecification of the functional form is a real issue. Adding quadratic terms is just one simple way of capturing non-linearities, but the general issue is modeling the conditional expectation function appropriately.
    – Carlos Cinelli
    Dec 2 at 20:17








  • 1




    Please do not include rm(list=ls()) in code posted here! If people just copy&paste and run the code, they could get a surprise ... I removed it for now.
    – kjetil b halvorsen
    Dec 2 at 22:34


















  • Thank you @CarlosCinelli , in conclusion, are you saying we should include terms of same degree - to account for potential misspecification of the functional form - and let the regression determine which terms are significant?
    – Kevin C
    Dec 2 at 19:56






  • 3




    @KevinC the main question here is: do you want to interpret the interaction term? If you do, then misspecification of the functional form is a real issue. Adding quadratic terms is just one simple way of capturing non-linearities, but the general issue is modeling the conditional expectation function appropriately.
    – Carlos Cinelli
    Dec 2 at 20:17








  • 1




    Please do not include rm(list=ls()) in code posted here! If people just copy&paste and run the code, they could get a surprise ... I removed it for now.
    – kjetil b halvorsen
    Dec 2 at 22:34
















Thank you @CarlosCinelli , in conclusion, are you saying we should include terms of same degree - to account for potential misspecification of the functional form - and let the regression determine which terms are significant?
– Kevin C
Dec 2 at 19:56




Thank you @CarlosCinelli , in conclusion, are you saying we should include terms of same degree - to account for potential misspecification of the functional form - and let the regression determine which terms are significant?
– Kevin C
Dec 2 at 19:56




3




3




@KevinC the main question here is: do you want to interpret the interaction term? If you do, then misspecification of the functional form is a real issue. Adding quadratic terms is just one simple way of capturing non-linearities, but the general issue is modeling the conditional expectation function appropriately.
– Carlos Cinelli
Dec 2 at 20:17






@KevinC the main question here is: do you want to interpret the interaction term? If you do, then misspecification of the functional form is a real issue. Adding quadratic terms is just one simple way of capturing non-linearities, but the general issue is modeling the conditional expectation function appropriately.
– Carlos Cinelli
Dec 2 at 20:17






1




1




Please do not include rm(list=ls()) in code posted here! If people just copy&paste and run the code, they could get a surprise ... I removed it for now.
– kjetil b halvorsen
Dec 2 at 22:34




Please do not include rm(list=ls()) in code posted here! If people just copy&paste and run the code, they could get a surprise ... I removed it for now.
– kjetil b halvorsen
Dec 2 at 22:34












up vote
3
down vote













The two models you listed in your answer can be re-expressed to make it clear how the effect of $X_1$ is postulated to depend on $X_2$ (or the other way around) in each model.



The first model can be re-expressed like this:



$$Y = beta_0 + (beta_1 + beta_3X_2)X_1 + beta_2X_2+ epsilon,$$



which shows that, in this model, $X1$ is assumed to have a linear effect on $Y$ (controlling for the effect of $X_2$) but the the magnitude of this linear effect - captured by the slope coefficient of $X_1$ - changes linearly as a function of $X_2$. For example, the effect of $X_1$ on $Y$ may increase in magnitude as the values of $X_2$ increase.



The second model can be re-expressed like this:



$$Y = beta_0 + (beta_1 + beta_3X_2)X_1 + beta_4 X_1^2 + beta_2X_2 +beta_5X_2^2 + epsilon,$$



which shows that, in this model, the effect of $X_1$ on $Y$ (controlling for the effect of $X_2$) is assumed to be quadratic rather than linear. This quadratic effect is captured by including both $X_1$ and $X_1^2$ in the model. While the coefficient of $X_1^2$ is assumed to be independent of $X_2$, the coefficient of $X_1$ is assumed to depend linearly on $X_2$.



Using either model would imply that you are making entirely different assumptions about the nature of the effect of $X_1$ on $Y$ (controlling for the effect of $X_2$).



Usually, people fit the first model. They might then plot the residuals from that model against $X_1$ and $X_2$ in turns. If the residuals reveal a quadratic pattern in the residuals as a function of $X_1$ and/or $X_2$, the model can be augmented accordingly so that it includes $X_1^2$ and/or $X_2^2$ (and possibly their interaction).



Note that I simplified the notation you used for consistency and also made ther error term explicit in both models.






share|cite|improve this answer



















  • 2




    Hi @IsabellaGhement , thank you for your explanation. In summary, there are really no "rules" in that we should add quadratic terms if we include interaction terms. At the end of the day, it comes back to the assumptions we are making about our model, and the results of our analysis (ie. residual plots). Is this correct? Thanks again :)!
    – Kevin C
    Dec 2 at 20:13






  • 2




    That's right, Kevin! There are no "rules", because each data set is different and is also meant to answer different questions. That is why it is important for us to be aware that each model we fit to that data set implies different assumptions, which need to be supported by the data for us to trust the model results. The model diagnostic plots (e.g., plot of residuals vs. fitted values) help us verify to what extent - if any - the data support the model assumptions.
    – Isabella Ghement
    Dec 2 at 21:44








  • 1




    Understood, thank you Isabella. Happy holidays!
    – Kevin C
    Dec 8 at 7:50






  • 1




    @KevinC: Great! Happy holidays to you too, Kevin! ☃🎉🎁🎈
    – Isabella Ghement
    Dec 8 at 14:52















up vote
3
down vote













The two models you listed in your answer can be re-expressed to make it clear how the effect of $X_1$ is postulated to depend on $X_2$ (or the other way around) in each model.



The first model can be re-expressed like this:



$$Y = beta_0 + (beta_1 + beta_3X_2)X_1 + beta_2X_2+ epsilon,$$



which shows that, in this model, $X1$ is assumed to have a linear effect on $Y$ (controlling for the effect of $X_2$) but the the magnitude of this linear effect - captured by the slope coefficient of $X_1$ - changes linearly as a function of $X_2$. For example, the effect of $X_1$ on $Y$ may increase in magnitude as the values of $X_2$ increase.



The second model can be re-expressed like this:



$$Y = beta_0 + (beta_1 + beta_3X_2)X_1 + beta_4 X_1^2 + beta_2X_2 +beta_5X_2^2 + epsilon,$$



which shows that, in this model, the effect of $X_1$ on $Y$ (controlling for the effect of $X_2$) is assumed to be quadratic rather than linear. This quadratic effect is captured by including both $X_1$ and $X_1^2$ in the model. While the coefficient of $X_1^2$ is assumed to be independent of $X_2$, the coefficient of $X_1$ is assumed to depend linearly on $X_2$.



Using either model would imply that you are making entirely different assumptions about the nature of the effect of $X_1$ on $Y$ (controlling for the effect of $X_2$).



Usually, people fit the first model. They might then plot the residuals from that model against $X_1$ and $X_2$ in turns. If the residuals reveal a quadratic pattern in the residuals as a function of $X_1$ and/or $X_2$, the model can be augmented accordingly so that it includes $X_1^2$ and/or $X_2^2$ (and possibly their interaction).



Note that I simplified the notation you used for consistency and also made ther error term explicit in both models.






share|cite|improve this answer



















  • 2




    Hi @IsabellaGhement , thank you for your explanation. In summary, there are really no "rules" in that we should add quadratic terms if we include interaction terms. At the end of the day, it comes back to the assumptions we are making about our model, and the results of our analysis (ie. residual plots). Is this correct? Thanks again :)!
    – Kevin C
    Dec 2 at 20:13






  • 2




    That's right, Kevin! There are no "rules", because each data set is different and is also meant to answer different questions. That is why it is important for us to be aware that each model we fit to that data set implies different assumptions, which need to be supported by the data for us to trust the model results. The model diagnostic plots (e.g., plot of residuals vs. fitted values) help us verify to what extent - if any - the data support the model assumptions.
    – Isabella Ghement
    Dec 2 at 21:44








  • 1




    Understood, thank you Isabella. Happy holidays!
    – Kevin C
    Dec 8 at 7:50






  • 1




    @KevinC: Great! Happy holidays to you too, Kevin! ☃🎉🎁🎈
    – Isabella Ghement
    Dec 8 at 14:52













up vote
3
down vote










up vote
3
down vote









The two models you listed in your answer can be re-expressed to make it clear how the effect of $X_1$ is postulated to depend on $X_2$ (or the other way around) in each model.



The first model can be re-expressed like this:



$$Y = beta_0 + (beta_1 + beta_3X_2)X_1 + beta_2X_2+ epsilon,$$



which shows that, in this model, $X1$ is assumed to have a linear effect on $Y$ (controlling for the effect of $X_2$) but the the magnitude of this linear effect - captured by the slope coefficient of $X_1$ - changes linearly as a function of $X_2$. For example, the effect of $X_1$ on $Y$ may increase in magnitude as the values of $X_2$ increase.



The second model can be re-expressed like this:



$$Y = beta_0 + (beta_1 + beta_3X_2)X_1 + beta_4 X_1^2 + beta_2X_2 +beta_5X_2^2 + epsilon,$$



which shows that, in this model, the effect of $X_1$ on $Y$ (controlling for the effect of $X_2$) is assumed to be quadratic rather than linear. This quadratic effect is captured by including both $X_1$ and $X_1^2$ in the model. While the coefficient of $X_1^2$ is assumed to be independent of $X_2$, the coefficient of $X_1$ is assumed to depend linearly on $X_2$.



Using either model would imply that you are making entirely different assumptions about the nature of the effect of $X_1$ on $Y$ (controlling for the effect of $X_2$).



Usually, people fit the first model. They might then plot the residuals from that model against $X_1$ and $X_2$ in turns. If the residuals reveal a quadratic pattern in the residuals as a function of $X_1$ and/or $X_2$, the model can be augmented accordingly so that it includes $X_1^2$ and/or $X_2^2$ (and possibly their interaction).



Note that I simplified the notation you used for consistency and also made ther error term explicit in both models.






share|cite|improve this answer














The two models you listed in your answer can be re-expressed to make it clear how the effect of $X_1$ is postulated to depend on $X_2$ (or the other way around) in each model.



The first model can be re-expressed like this:



$$Y = beta_0 + (beta_1 + beta_3X_2)X_1 + beta_2X_2+ epsilon,$$



which shows that, in this model, $X1$ is assumed to have a linear effect on $Y$ (controlling for the effect of $X_2$) but the the magnitude of this linear effect - captured by the slope coefficient of $X_1$ - changes linearly as a function of $X_2$. For example, the effect of $X_1$ on $Y$ may increase in magnitude as the values of $X_2$ increase.



The second model can be re-expressed like this:



$$Y = beta_0 + (beta_1 + beta_3X_2)X_1 + beta_4 X_1^2 + beta_2X_2 +beta_5X_2^2 + epsilon,$$



which shows that, in this model, the effect of $X_1$ on $Y$ (controlling for the effect of $X_2$) is assumed to be quadratic rather than linear. This quadratic effect is captured by including both $X_1$ and $X_1^2$ in the model. While the coefficient of $X_1^2$ is assumed to be independent of $X_2$, the coefficient of $X_1$ is assumed to depend linearly on $X_2$.



Using either model would imply that you are making entirely different assumptions about the nature of the effect of $X_1$ on $Y$ (controlling for the effect of $X_2$).



Usually, people fit the first model. They might then plot the residuals from that model against $X_1$ and $X_2$ in turns. If the residuals reveal a quadratic pattern in the residuals as a function of $X_1$ and/or $X_2$, the model can be augmented accordingly so that it includes $X_1^2$ and/or $X_2^2$ (and possibly their interaction).



Note that I simplified the notation you used for consistency and also made ther error term explicit in both models.







share|cite|improve this answer














share|cite|improve this answer



share|cite|improve this answer








edited Dec 2 at 1:21

























answered Dec 2 at 1:16









Isabella Ghement

6,023320




6,023320








  • 2




    Hi @IsabellaGhement , thank you for your explanation. In summary, there are really no "rules" in that we should add quadratic terms if we include interaction terms. At the end of the day, it comes back to the assumptions we are making about our model, and the results of our analysis (ie. residual plots). Is this correct? Thanks again :)!
    – Kevin C
    Dec 2 at 20:13






  • 2




    That's right, Kevin! There are no "rules", because each data set is different and is also meant to answer different questions. That is why it is important for us to be aware that each model we fit to that data set implies different assumptions, which need to be supported by the data for us to trust the model results. The model diagnostic plots (e.g., plot of residuals vs. fitted values) help us verify to what extent - if any - the data support the model assumptions.
    – Isabella Ghement
    Dec 2 at 21:44








  • 1




    Understood, thank you Isabella. Happy holidays!
    – Kevin C
    Dec 8 at 7:50






  • 1




    @KevinC: Great! Happy holidays to you too, Kevin! ☃🎉🎁🎈
    – Isabella Ghement
    Dec 8 at 14:52














  • 2




    Hi @IsabellaGhement , thank you for your explanation. In summary, there are really no "rules" in that we should add quadratic terms if we include interaction terms. At the end of the day, it comes back to the assumptions we are making about our model, and the results of our analysis (ie. residual plots). Is this correct? Thanks again :)!
    – Kevin C
    Dec 2 at 20:13






  • 2




    That's right, Kevin! There are no "rules", because each data set is different and is also meant to answer different questions. That is why it is important for us to be aware that each model we fit to that data set implies different assumptions, which need to be supported by the data for us to trust the model results. The model diagnostic plots (e.g., plot of residuals vs. fitted values) help us verify to what extent - if any - the data support the model assumptions.
    – Isabella Ghement
    Dec 2 at 21:44








  • 1




    Understood, thank you Isabella. Happy holidays!
    – Kevin C
    Dec 8 at 7:50






  • 1




    @KevinC: Great! Happy holidays to you too, Kevin! ☃🎉🎁🎈
    – Isabella Ghement
    Dec 8 at 14:52








2




2




Hi @IsabellaGhement , thank you for your explanation. In summary, there are really no "rules" in that we should add quadratic terms if we include interaction terms. At the end of the day, it comes back to the assumptions we are making about our model, and the results of our analysis (ie. residual plots). Is this correct? Thanks again :)!
– Kevin C
Dec 2 at 20:13




Hi @IsabellaGhement , thank you for your explanation. In summary, there are really no "rules" in that we should add quadratic terms if we include interaction terms. At the end of the day, it comes back to the assumptions we are making about our model, and the results of our analysis (ie. residual plots). Is this correct? Thanks again :)!
– Kevin C
Dec 2 at 20:13




2




2




That's right, Kevin! There are no "rules", because each data set is different and is also meant to answer different questions. That is why it is important for us to be aware that each model we fit to that data set implies different assumptions, which need to be supported by the data for us to trust the model results. The model diagnostic plots (e.g., plot of residuals vs. fitted values) help us verify to what extent - if any - the data support the model assumptions.
– Isabella Ghement
Dec 2 at 21:44






That's right, Kevin! There are no "rules", because each data set is different and is also meant to answer different questions. That is why it is important for us to be aware that each model we fit to that data set implies different assumptions, which need to be supported by the data for us to trust the model results. The model diagnostic plots (e.g., plot of residuals vs. fitted values) help us verify to what extent - if any - the data support the model assumptions.
– Isabella Ghement
Dec 2 at 21:44






1




1




Understood, thank you Isabella. Happy holidays!
– Kevin C
Dec 8 at 7:50




Understood, thank you Isabella. Happy holidays!
– Kevin C
Dec 8 at 7:50




1




1




@KevinC: Great! Happy holidays to you too, Kevin! ☃🎉🎁🎈
– Isabella Ghement
Dec 8 at 14:52




@KevinC: Great! Happy holidays to you too, Kevin! ☃🎉🎁🎈
– Isabella Ghement
Dec 8 at 14:52


















draft saved

draft discarded




















































Thanks for contributing an answer to Cross Validated!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


Use MathJax to format equations. MathJax reference.


To learn more, see our tips on writing great answers.





Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


Please pay close attention to the following guidance:


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f379841%2fin-linear-regression-why-should-we-include-quadratic-terms-when-we-are-only-int%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Quarter-circle Tiles

build a pushdown automaton that recognizes the reverse language of a given pushdown automaton?

Mont Emei