Fourth Blog Post!
Prompt:
When building regression models, it often takes a lot of experience and knowledge about your data in order to determine the variables and transformation of variables that you want to include in the model building process. There are many variable selection techniques (or feature selection) but it can be a confusing practice when you are first learning. Write up a brief discussion of how you would plan to determine variables to use in a regression model. What variable selection techniques do you prefer and why?
Response:
I don’t have much experience with regression modeling, but I’ll give my opinion based on what I learned from the assigned reading. There are many things to keep in mind regarding regression. (1) It is important that we explain the data in the simplest way, with redundant predictors removed. (2) Unecessary predictors will add noise and waste the degrees of freedom. (3) Collinearity is caused by having too many variables trying to do the same job. Prior to variable selection, we need to indentify any outliers and influential points as well as add in any transformations of the variables that seem appropriate.
There seems to be two extremes with variable selection: a model with the best predictor variable and a model with all predictor variables. The problem with regression modeling is trying to find the necessary variables among the complete set of variables by deleting both irrelevant variables and redundant variables. We will discuss five different methods for variable selection. The first two variable selection methods are (1) forward selection (FS) and (2) backward elimination (BE). These models either add or delete variables one by one to find the best model. These aren’t necessarily the best methods, as you may end up with irrelevant or redundant variables in your model. The next best thing is (3) stepwise selection (SW), which is a modification of forward selection. It differs because variables that are in the model already do not necessarily stay. This seems better than FS because variables can be removed from the model if deemed unecessary.
Another method is (4) R-squared. This method finds several subsets of different sizes that best predict the dependent variable based on the appropriate test statistic. A potential issue with this type of selection is that it is unlikely that one subset will stand out as clearly being the best. However, this could allow you to select a subset using non-statistical conditions. The fifth method is (5) all-possible subsets. This method builds all one-variable models, all two-variable models, and so on, until the last all-variable model is generated. The downside of this method is that it requires a very powerful computer.
Looking over all of the methods above, I think the one I prefer is either (3) stepwise selection or (5) all-possible subsets. Seeing as how my computer probably isn’t powerful enough to handle the latter, I will go with stepwise selection. I like this one because it removes variables that are unecessary, which is something that FS and BE do not do. The (4) R-squared method could be useful if I have non-statistical conditions that I have to take into account!