Data preprocessing

Why preprocessing ?

  1. Real world data are generally
  2. Tasks in data preprocessing

Data cleaning

  1. Fill in missing values (attribute or class value):
  2. Identify outliers and smooth out noisy data:
  3. Correct inconsistent data: use domain knowledge or expert decision.

Data transformation

  1. Normalization:
  2. Aggregation: moving up in the concept hierarchy on numeric attributes.
  3. Generalization: moving up in the concept hierarchy on nominal attributes.
  4. Attribute construction: replacing or adding new attributes inferred by existing attributes.

Data reduction

  1. Reducing the number of attributes
  2. Reducing the number of attribute values
  3. Reducing the number of tuples

Discretization and generating concept hierarchies

  1. Unsupervised discretization -  class variable is not used.
  2. Supervised discretization - uses the values of the class variable.
  3. Generating concept hierarchies: recursively applying partitioning or discretization methods.