
The data was downloaded from the UC Irvine Machine Learning Repository. Computer-derived nuclear features distinguish malignant from benign breast cytology. Computerized breast cancer diagnosis and prognosis from fine needle aspirates. Analytical and Quantitative Cytology and Histology, Vol. Image analysis and machine learning applied to breast cancer diagnosis and prognosis. Machine learning techniques to diagnose breast cancer from fine-needle aspirates. Breast cancer diagnosis and prognosis via linear programming. IS&T/SPIE 1993 International Symposium on Electronic Imaging: Science and Technology, volume 1905, pages 861-870, San Jose, CA, 1993. Nuclear feature extraction for breast tumor diagnosis. The data I am going to use to explore feature selection methods is the Breast Cancer Wisconsin (Diagnostic) Dataset: And once you’ve got a feel for your data, investing the time and effort to compare different feature selection methods (or engineered features), model parameters and - finally - different machine learning algorithms can make a big difference!īreast Cancer Wisconsin (Diagnostic) Dataset With machine learning, there is no “one size fits all”! It is always worthwhile to take a good hard look at your data, get acquainted with its quirks and properties before you even think about models and algorithms. But even this small example shows how different features and parameters can influence your predictions. My conclusions are of course not to be generalized to any ol’ data you are working with: There are many more feature selection methods and I am only looking at a limited number of datasets and only at their influence on Random Forest models. data that doesn’t allow a good classification to begin with (because the features are not very distinct between classes) don’t necessarily benefit from feature selection.GA produced the best models in this example but is impractical for everyday use-cases with many features because it takes a lot of time to run with sufficient generations and individuals and.removing highly correlated features isn’t a generally suitable method,.For that I am using three breast cancer datasets, one of which has few features the other two are larger but differ in how well the outcome clusters in PCA.īased on my comparisons of the correlation method, RFE and GA, I would conclude that for Random Forest models Recursive Feature Elimination (RFE) andĪdditionally, I want to know how different data properties affect the influence of these feature selection methods on the outcome.Here, I am going to examine the effect of feature selection via There are several ways to identify how much each feature contributes to the model and to restrict the number of selected features. Using as few features as possible will also reduce the complexity of our models, which means it needs less time and computer power to run and is easier to understand. Because too many (unspecific) features pose the problem of overfitting the model, we generally want to restrict the features in our models to those, that are most relevant for the response variable we want to predict. Using a suitable combination of features is essential for obtaining high precision and accuracy. variables or attributes) to generate predictive models.

Machine learning uses so called features (i.e.
