Data mining
Classification trees – used to predict categorical responses. Algorithm splits dataset at each branch to maximise some criterion (eg. entropy, information, etc.). Can also predict continuous data (leaf gives expected value)
Problems:
- can be unnecesarrily complicated
- structure often unstable (over-fitting)
Validation and standard errors:
- applying to training set gives very optimistic errors
- cross-validation: separate test dataset (wastes data)
- k-fold cross-validating (divide into k parts, keep 1/k out for testing,
- train on rest, repeat for all k parts)
- bootstrapping, jack-knifing etc.
Many other (more complicated) methods, with various improvement in prediction. Still no consensus on which method is best in different situations.
Ethics
Don’t forget about ethics! Combining information via data warehousing could violate Privacy Act. Data mining raises ethical issues mainly during application – should we use ethnicity if it is a good predictor? Ethics depends on application.