# Level 1 CFA® Exam:

ANOVA & Hypothesis Testing

In this lesson we're going to continue our discussion of linear regression model analysis. Statistical methods and econometrics find wide application in capital markets, so thorough knowledge of quantitative methods used for valuation is extremely important.

As financial analysts, you are likely to need such skills as: testing hypotheses concerning variables, predicting the values of variables, and using ANOVA tables.

Let's go through these and look at examples.

Let's start with hypothesis testing on the regression coefficients. The most common and popular way to test the significance of a slope coefficient or to see how much it differs from a selected value is to use the Student's t-test with \(n-2\) degrees of freedom:

\(t = \frac{\hat{b}_{1} - B_{1}}{s_{\hat{b}_{1}}}\)

- \(t\) - t-statistic
- \(\hat{b}_{1}\) - estimated value of slope coefficient
- \(B_{1}\) - hypothesized population slope coefficient
- \(s_{\hat{b}_{1}}\) - standard error of the slope coefficient

A decision on whether or not to reject the hypothesis that a slope coefficient is equal to a given value is made based on the following test:

Reject the null if the T-statistic is greater than the T-test critical value or lower than minus T-test critical value.

Assume that an estimated value of the slope coefficient is equal to 0.68 and the standard error of the slope coefficient is equal to 0.2. Let’s test if the slope coefficient is significantly different from 1 with a 90% confidence interval and 25 observations.

(...)

Assume that an estimated value of the slope coefficient is equal to 0.6818 and the standard error of the slope coefficient is equal to 0.2. Let’s test if the slope coefficient is significantly different from 0 with a 95% confidence interval and 25 observations.

(...)

Let's now move on to an important thing that will enhance your inventory of analytical tools.

One great tool for explaining the variation of the dependent variable is a procedure called analysis of variance, ANOVA for short.

Analysis of variance (ANOVA) consists in dividing the total variability of a variable into different sources.

How do we, as analysts, use this procedure? It finds its application in determining how useful an independent variable is (or a group of independent variables are) for explaining the variation in the dependent variable.

That is why, let’s have a look at a couple of formulas that you, as a CFA Candidate, need to know to be able to use ANOVA tables.

We know from our previous lesson that total variation equals explained variation plus unexplained variation:

\(SST=SSR+SSE\)

- \(SST\) - sum of squares total (total variation in dependent variable)
- \(SSR\) - sum of squares regression (explained variation in dependent variable)
- \(SSE\) - sum of squares error (unexplained variation in dependent variable)

Sum of squares total (SST) measures the total variation in the dependent variable. SST is computed as the sum of the squared differences between each real value of the dependent variable and the mean value of the dependent variable:

\(SST=\sum_{i=1}^{n}(Y_{i}-\bar{Y})^2\)

- \(SST\) - sum of squares total
- \(Y_{i}\) - i-th observation of dependent variable Y
- \(\bar{Y}\) - mean of variable Y
- \(n\) - number of observations

Sum of squares regression (SSR) measures the variation in the dependent variable explained by the variation in the independent variable. SSR is computed as the sum of the squared differences between the dependent variable to be predicted and the mean value of the dependent variable:

\(SSR=\sum_{i=1}^{n}(\hat{Y}_{i}-\bar{Y})^2\)

- \(SSR\) - sum of squares regression
- \(\hat{Y}_{i}\) - estimated value of i-th observation of dependent variable Y
- \(\bar{Y}\) - mean of variable Y
- \(n\) - number of observations

Sum of Squares Error (SSE) measures the unexplained variation in the dependent variable, that is the variation that hasn't been explained by the variation in the independent variable. It's the sum of the squared differences between the real value of the dependent variable and the predicted value of the dependent variable:

\(SSE=\sum_{i=1}^{n}(Y_{i}-\hat{Y})^2\)

- \(SST\) - sum of squares error
- \(Y_{i}\) - i-th observation of dependent variable Y
- \(\hat{Y}\) - estimated value of i-th observation of dependent variable
- \(n\) - number of observations

Another essential tool is the F-statistic. We use the F-statistic to determine how a group of independent variables explains the variation in the dependent variable. F-statistic measures whether we can explain a significant part of the variance in the dependent variable using at least one independent variable.

In a simple regression model with one independent variable, The F-statistic is given by this formula:

\(F = \frac{\frac{SSR}{1}}{\frac{SSE}{n-2}} = \frac{MSR}{MSE}\)

- \(F\) - F-statistic
- \(SSR\) - sum of squares regression
- \(SSE\) - sum of squares error
- \(n\) - total number of observations
- \(MSR\) - mean square regression
- \(MSE\) - mean square error

If you recall the formula for T-test that we used for testing the significance of the slope coefficient, you will notice that the F-test here is equal to the T-test squared. Thus, in our simple regression model with one independent variable, we will rather use the T-test than the F-test.

However, in the case of multiple regression models with several independent variables, this test is very useful.

The F-test is always one-sided. We choose the F-test table for the given significance level with \(1\) degree of freedom in the column of the table and \(n-2\) degrees of freedom in the row of the table.

For a simple linear regression model (with one independent variable), the F-test will be based on the following hypotheses:

\(H_0: \ b_1=0\)

\(H_A: \ b_1\text{ ≠ }0\)

As you can see, they're the same hypotheses as in the Student's t-test. The F-statistic can be compared to the critical value of the F-test read from the table with \(1\) and \(n-2\) degrees of freedom.

When F-statistic is greater than the F-test critical value, the null hypothesis is rejected in favor of the alternative hypothesis which states that the independent variable is statistically significant (the slope different from 0). What is meant by statistical significance of an independent variable is that the variable significantly explains the variation in the dependent variable.

Have a look at an ANOVA table which summarizes the variation in the dependent variable. ANOVA has many applications and it's a part of many statistical packages. It's perfect for analysing the variation of the explained variable, identifying the source of variation and the proportion of variation that is not directly explained by the variation in the independent variable. An ANOVA table can usually be used as a source of data for computing measures of variation that interest us. So, have a look at an ANOVA table for simple regression (one with a single independent variable).

(...)

The coefficient of determination (\(R^2\)) and the standard error of estimate (SEE) can be calculated directly from an ANOVA table using the following formulas:

\(R^2=\frac{\Sigma^{n}_{i=1} (\hat{Y}_{i} - \bar{Y}_{i})^{2}}{\Sigma^{n}_{i=1} (Y_{i} - \bar{Y}_{i})^{2}}=\\=\frac{SSR}{SST}=\frac{SST-SSE}{SST}=1-\frac{SSE}{SST}\)

- \(\hat{Y}\) - estimated value of i-th observation of dependent variable
- \(\bar{Y}\) - mean of variable Y
- \(n\) - number of observations in the sample
- \(SST\) - sum of squares total (total variation)
- \(SSR\) - sum of squares regression (explained variation)
- \(SSE\) - sum of squares error (unexplained variation)

\(SEE =\sqrt{MSE}=\sqrt{\frac{SSE}{n-2}}=\\= \sqrt{\frac{\Sigma^{n}_{i=1} (Y_{i} - \hat{Y}_{i})^{2}}{n-2}} = \sqrt{\frac{\Sigma^{n}_{i=1} (Y_{i} - \hat{b}_{0} - \hat{b}_{1}\times X_{i})^{2}}{n-2}} = \sqrt{\frac{\Sigma^{n}_{i=1} (\hat{\varepsilon}_{i})^{2}}{n-2}}\)

- \(SEE\) - standard error estimate
- \(MSE\) - mean square error
- \(SSE\) - sum of squares error
- \(Y\) - dependent variable
- \(X\) - independent variable
- \(\hat{b}_{0}\) - estimated value of intercept
- \(\hat{b}_{1}\) - estimated value of slope coefficient
- \(\hat{\varepsilon}_{i}\) - expected value of error term
- \(n\) - number of observations in the sample

Let's examine an example illustrating the application of an ANOVA table. What matters most here is the interpretation of variance analysis results.

Complete the ANOVA table for the regression model for growth in ABC's net income. Compute the coefficient of determination and the standard error of estimate. Assume that there are 25 observations.

Source of variation | Degrees of freedom (df) | Sum of squares | Mean sum of squares |
---|---|---|---|

Regression (explained variation) | \(0.06\) | ||

Error (unexplained variation) | \(0.009\) | ||

Total |

(...)

Now, let's take a look at an example of how to use the F-statistic.

Based on the data in the ANOVA table:

Source of variation | Degrees of freedom (df) | Sum of squares | Mean sum of squares |
---|---|---|---|

Regression (explained variation) | \(k=1\) | \(SSR=0.070532\) | \(MSR=SSR=0.070532\) |

Error (unexplained variation) | \(n-2=25\) | \(SSE=0.01\) | \(MSE=\frac{SSE}{n-2}=\\=\frac{0.01}{27-2}=0.0004\) |

Total | \(n-1=26\) | \(SST=0.070932\) |

using the F-statistic, test the hypothesis that the slope coefficient of the independent variable is significantly different from zero with a level of significance of 0.5%.

(...)

Financial analysts often use data obtained thanks to regression models to predict the future values of a dependent variable. This is done using prediction intervals. Confidence intervals for the values of the predicted dependent variable are analogous to confidence intervals calculated for the slope coefficient, with one difference. Instead of the standard error of estimate, we use the standard error of the forecast (\(s_f\)). The confidence interval extends from:

\(\hat{Y}_f-t_c\times{s_f}\) to \(\hat{Y}_f+t_c\times{s_f}\)

The estimated variance of the prediction error (\({s^2}_f\)) is given by the following formula:

\(s^{2}_{f} = SEE^{2}\times [1 + \frac{1}{n} + \frac{(X_f - \overline{X})^{2}}{(n-1)\times s^{2}_{x}}]=\\=SEE^{2}\times [1 + \frac{1}{n} + \frac{(X_f - \overline{X})^{2}}{\Sigma^{n}_{i=1} (X_{i} - \bar{Y}_{i})^{2}}]\)

- \(s^{2}_{f}\) - estimated variance of the prediction error
- \(SEE^{2}\) - squared standard error of estimate
- \(n\) - number of observations
- \(X_i\) - i-th observation of the independent variable
- \(X_f\) - forecasted value of the independent variable
- \(\overline{X}\) - mean of the independent variable
- \(s^{2}_{X}\) - variance of the independent variable

The standard error of the forecast (\(s_f\)) which is the square root of estimated variance of the prediction error is given by the following formula:

\(s_{f} = SEE\sqrt{1 + \frac{1}{n} + \frac{(X_f - \overline{X})^{2}}{(n-1)\times s^{2}_{x}}}=\\=SEE\sqrt{1 + \frac{1}{n} + \frac{(X_f - \overline{X})^{2}}{\Sigma^{n}_{i=1} (X_{i} - \bar{Y}_{i})^{2}}}\)

- \(s_{f}\) - standard error of the forecast
- \(SEE\) - standard error of estimate
- \(n\) - number of observations
- \(X_i\) - i-th observation of the independent variable
- \(X_f\) - forecasted value of the independent variable
- \(\overline{X}\) - mean of the independent variable
- \(s^{2}_{X}\) - variance of the independent variable

To build a prediction interval for a prediction we need to follow 4 steps:

(...)

Let's now move on to an example illustrating the application of prediction intervals.

Let's make a prediction about the growth in ABC's net income for the next year assuming a 5% increase in GDP in the same period. Let's assume the intercept is 0.0196 and the slope coefficient is 0.68. Please calculate the confidence interval for the predicted values of the dependent variable with a 5% significance level and then interpret the results. The standard error of the forecast is 4.02% and there are 25 observations.

(...)

Let's now have a look at the limitations of regression analysis. We should pay special attention to them as they can significantly distort a regression model. We will talk about:

- parameters instability,
- access to linear relations, and
- satisfaction of regression model assumptions.

Let’s begin with parameters instability. As we know linear relations change over time. If we use a different time period, for instance we take data from a 12-month period instead of 6-month period, the relations may differ significantly. This is so because time affects the value of parameters. What it means is that parameters can change in time significantly, which is a limitation of regression models.

Now let’s say a few words about access to linear relations. The application of regression models that proved to be accurate in the past may not be reasonable. If for example an analyst discovered a regression model that helps him to earn abnormal profit on a group of companies, probably other analysts will quickly follow suit and the analyst's advantage will disappear.

And finally let’s talk about satisfaction of regression model assumptions. If the assumptions underlying a regression model are not satisfied, for example if there is heteroskedasticity, or a distribution of the error term is other than normal, any interpretation of the results of the model and hypothesis tests will be incorrect. Obviously, it's possible to use tests for the lack of satisfaction of the assumptions of a model, but in practice they aren't always applied.

- Analysis of variance (ANOVA) consists in dividing the total variability of a variable into different sources: explained variation and unexplained variation.
- Sum of squares total (SST) measures the total variation in the dependent variable.
- Sum of squares regression (SSR) measures the variation in the dependent variable explained by the variation in the independent variable.
- Sum of Squares Error (SSE) measures the unexplained variation in the dependent variable, that is the variation that hasn't been explained by the variation in the independent variable.
- The F-statistic is used to test how a group of independent variables explains the variation of a dependent variable. It measures whether we can explain a significant part of the variance in the dependent variable using at least one independent variable.
- In simple regression model with one independent variable, instead of the F-test we can use the T-test. However, the F-test is very useful for multiple regression with several independent variables for which cannot use the T-test.
- To build a prediction interval for a prediction we need to (a) make the prediction about the dependent variable using the forecasted value of the independent variable and the linear regression, (b) select a significance level, (c) compute the standard error of the forecast, and (d) determine the prediction interval.