[slide 01] Hi! The goal of this video is to introduce you to the LINEST spreadsheet function, which stands for 'linear estimation'. It allows us to conduct single and multiple regression analysis, and outputs model coefficients and associated statistics all in one fell swoop. Regression analysis is a powerful and versatile tool which is the go to technique for many problems. We will be relying on it heavily in the following weeks in a number of contexts, so it is important to get comfortable with it at this point. [slide 02] For this excursion, we are going to use a free data set of weekly investor sentiment from the American Association of Individual Investors. Every week, we get a reading on the fraction of investors that are bullish - that is, expect the market to go up, bearish - that is, expect the market to go down, and neutral. We also have the weekly price for the S&P 500 index. Let's see if we can predict the behavior of the market based on investor sentiment. Wouldn't that be a neat trick? [slide 03] The obvious first model is to examine the relationship between market return at time t with the levels of Bull and Bear sentiment the previous week. The next model tests that maybe the changes in the bull/bear sentiment rather than the levels are more important. We'll explore the second model in this video, and I'll leave you play around with other variations on your own later. [slide 04] The LINEST function is thoroughly documented on the Microsoft Office website, at the link you see in this slide, which is also listed in the regression resources page for this week. The function takes four arguments: the first is the dependent variable array, the second is one or more columns of independent variables, the third is a binary 0-1 argument that specifies whether you want to include an intercept term, and the last is another binary argument that specifies whether you want additional model statistics. LINEST is an array function, so after we put in the formula, we must hit Control-Shift-Enter to get the full output. Or check the array checkbox in the function editor. The output has the model coefficients in the top row, going from right to left, and the standard errors for the coefficients in the next row. It is important to pay attention to the order - it is counterintuitive, and basically the opposite of what you would expect! In the third row, we get the model R^2 and the standard error of the estimate, In the fourth row we get the F statistic and the degrees of freedom, and in the fifth row we get the regression sum of squares and the residual sum of squares. The output cells will not be labeled - we'll just get a table of data, so you'll just have to refer to the documentation for the layout to figure out what's where. [slide 05] Now let's take a look at the data and play! This data set is available to you on the course website in the module for this week. The first column is the date, with weekly intervals, the next three columns are the percentages of bullish, neutral, or bearish investors that week - which of course all add up to 1 plus or minus a small rounding error. The next few columns give some additional stats like the 8 week moving average and such, and the last column is the SP500 index value for that week. Let's just delete all the unnecessary extras to slim things down. Since our model calls for SP500 returns, and changes in Bull and Bear percentages, let's create those columns. [insert changebull, changebear, return, fill down] First, let's try a single variable regression, regressing the weekly S&P500 return on the change in the bearish sentiment percentage the previous week. We are going to use 40 most recent observations for this regression, just to keep things compact, but there's no rule to say exactly what our data window has to be - you can play with different lookback windows and see what things look like! Let's enter our formula. LINEST, we have the dependent variable, which is return on the SP500 weekly. Then we have our independent variable, which is the change in the bearish sentiment the previous week. So notice we start our selection one cell down, and select 40 rows of that data. and then we'll choose 1 and and 1 for the remaining arguments, because we do want that intercept and additional model statistics. Now let's hit Ctl-Shift-Enter and observe the result. Let's examine what this is telling us. The coefficients are here at the top, here's the intercept and here's the coefficient for the bearish sentiment, and these are the standard errors for the two coefficients. The rest are associated model statistics, there's the R^2, there's the F statistic, there's the degrees of freedom, and some residual and model squared errors. So how do we interpret all of these results? The first thing we want to know is whether the model is statistically significant. You can see the R^2 is not very good over here, but we won't really know until we calculate the t statistics and the p values. Recall that our t statistics are the ratio of the coefficient size divided by the standard error, if we are testing for difference from zero. So this is the t statistic for our coefficient on the change in bear sentiment, and this one is the t statistic for our intercept. Now are these significant? In order to figure that out we have to calculate the p values for these t statistics. So let's use the tstat function and give it the t statistic, the degrees of freedom, and specify that we want the 2 tailed test. And of course we are going to lock this in because I'm anticipating that we want to fill this to the right. Let's see, what's wrong with that? Aha, of course the function is called TDIST not TSTAT. There it is, that is the p value for our first t statistic, and that is ... we locked in the row instead of the column, which is why it shifted to the right, let's just fix it manually. So there is our t statistics. Our intercept just barely misses being significant at a 5% level, and the coefficient estimate on the change in bearish sentiment is definitely not significant, with a p value of .38, which means there's a good 38% chance that whatever coefficient we observe could be not actually different from zero. Let's label these here, tstat and here we have pvalues. Now let's see if the overall model is significant. This right here if you refer to the documentation is the F statistic for the overall model, that we can test for significance. Of course since both of the coefficients are not actually significant, we anticipate that the overall model is also not significant, but let's do it anyway. So we have FDIST, we give it the f statistic, we give it the number of degrees of freedom in the model, which is 2 because we have the intercept and also the change in bearish sentiment, and also the second degrees of freedom, which is based on the number of observations minus the number of coefficients, which is here, and let's see what our result looks like. Unsurprisingly, the F statistic for the entire model is also not significant, 46% chance that all of it is just equal to zero. Let's give it a label, F pvalue. So maybe the change in the bearish sentiment is not particularly a good predictor of next week's SP500 returns. Why don't we try the change in the bullish sentiment? In order to edit the model, we have to select the whole range of output for this array function, otherwise we are going to get an error if we try to edit just one cell. As you can see here we cannot change only part of an array. so if we select the whole thing, now we can start editing this function. Let's just take our reference to the column of changes in bearish sentiment, and drag it over one column over to the left to get the bullish sentiment, and again we have to hit ctl-shift-enter, because this is an array function, and there is our new model. Everything is already calculated for us from the previous model, we have the t statistics, we have the p values, and we have the p value for the overall model, and again, as you can see from the p values nothing is even really close to significance here. Now let's try a multiple regression. We can use both the change in the bullish and bearish sentiment in the same model, at the same time, to see if they jointly have a correlation with next week's SP500 market returns. To run a multiple regression we don't really need to do anything different, except for selecting multiple columns of data for our independent variables. Let's see how this works. So we have LINEST, our dependent variable is still the weekly close of SP500 index (returns), select 40 observations, and our independent variables are going to be both the bull and bear change the previous week. Select 40 observations, and again we will select 1 and 1 for the remaining two arguments, and hit ctl-shift-enter. And there is our result. Notice that in the spots where we are not expecting any output, we get NA, which is fine, don't worry about it. And again in the top row we get the coefficients, in the second row we get the standard errors, and the rest of the output is the same as before. Let us again calculate the statistics and the p values to see if anything is significant. tstats over here, so the t statistic is again the coefficient value divided by the standard error, and let's fill them to the right to include all the coefficients. The first one here on the right is the intercept, the second one is the bull change, the change in bullish sentiment, because remember the bull change was the first of the two coefficients, and the last one is the bear change, here in the leftmost column. Let's calculate the p values, that'll be TDIST, give it the statistic, give it the degrees of freedom that we want to lock in, and give it 2 for a two tailed test. And there is our p value, let's go to the right... So it appears that none of the coefficients here are significant at a 5% level. Let's also confirm what things are looking like at the level of the model, by calculating the p value for the model F statistic. FDIST, give it the f statistic, give it 3 now, because we have 3 degrees of freedom with the intercept and the two independent variables, and the model degrees of freedom based on the number of observations. Let's see what we get, 0.43, again not a surprise that this is also not significant at the level of the model. Now let's try another thing. we might be thinking, perhaps the changes in bull and bear sentiment might take a while to propagate to actual market activity. So maybe instead of looking at the changes in bull and bear sentiment the previous week, we are going to look at the changes in bull and bear sentiment two weeks before. And let's see if that changes the predictive ability of our model. Let's hit ctl-shift-enter, and observe. So first of all, you'll notice our p values have acquired some error output, that's because the TDIST function requires a positive input for the t statistic, since it is symmetric it basically assumes you always use the positive tail. So let's give it the ABS function to make sure we take the absolute value of the statistic before feeding it in, and let's fill it to the side. And let's observe, you can see that the p value for the change in the bearish sentiment two weeks ago, now we are looking at changes two weeks before the return, is significant at about the 1% level, the change in bull sentiment is still not significant, and the intercept is now also significant at the 5% level. And the overall model, not surprisingly, is significant at about the 1% level. Let's discuss the interpretation of these coefficients. This coefficient on the change in the bear sentiment means that when the bearish sentiment goes up by one percentage point, the return in the market the next week (two weeks actually) goes down by 0.14 percentage points. This is not an insignificant result. Have we discovered something new and exciting here that will generate wonderful returns by using this strategy? Before throwing real money into it, I'll leave it up to you to do some additional testing. Perhaps you can try different time windows, or you can see if this pattern persists as you try it in different years. Try the same thing one year ago, two years ago, three years ago. There is really a lot more testing that needs to be done before you start throwing money at your trading strategy. The point of this exercise is not to say, here's a wonderful strategy, but to demonstrate to you how to use the tool of regression analysis in order to answer questions about the relationships between various variables. [slide 06] I hope you enjoyed this introduction to regression and you see how powerful this technique can be for finding patterns in data. Perhaps this will whet your appetite for learning more about it and applying it to all kinds of problems you come across in the future. See you on the forums!