What are the two primary goals of data preprocessing?

The two primary goals are to assess and improve the quality of the data, and to shape the data into the specific format required for the modeling task.

How are missing values typically handled during data cleaning?

If an attribute has too many missing values (e.g., 70%), the attribute can be dropped. Otherwise, missing values can be filled using simple techniques like mean imputation (global or categorical mean), while being mindful of the bias introduced.

Why is sampling with replacement often preferred over sampling without replacement for numerosity reduction?

Sampling with replacement tends to retain the true distribution of the original data in the sample, leading to less bias, especially when the sample size is a significant fraction of the total dataset.

What is the main difference between feature subset selection and PCA in dimensionality reduction?

Feature subset selection involves choosing the best subset from the original set of dimensions, whereas PCA projects the original data space into a new feature space, creating new dimensions (principal components) ranked by variance.

Why is normalization of numeric attributes necessary before analysis?

Normalization is needed when multiple numeric attributes have vastly different ranges (e.g., age vs. salary) to ensure that attributes with smaller ranges do not become insignificant in the analysis.

CS 432 - Lecture 4, Data Preprocessing

Data Preprocessing Goals
📌 Data preprocessing has two primary goals: improving data quality and shaping the data into the necessary format for modeling tasks.
📈 Data shaping involves different selections of data, attribute sets, object sets, and physical data formats (e.g., tabular, CSV, graphical).
📊 Data quality assessment is performed both quantitatively (e.g., counting missing values/nulls) and qualitatively.

Assessing Data Quality (Qualitative)
⭐ Key qualitative characteristics for data quality assessment include accuracy (closeness to the true value), completeness, consistency (especially when integrating multiple sources), timeliness, believability, and interpretability.
⏳ Outdated data is considered low quality; for instance, sales trend analysis may only require the last 1-2 years of data.
🧐 Believability requires the data scientist to judge whether observed values "make sense" within the business context.

Data Cleaning
🧹 Data cleaning addresses poor quality due to missing values, noise/outliers, and inconsistencies.
📉 When handling missing values, attributes with over 70% missing data may be dropped; otherwise, simple techniques like filling with the mean (or category-specific mean) are preferred over complex models due to introduced bias.
📊 Noise is random variation; it can be smoothed, often using bin-based smoothing (replacing values in a bin with the mean of that bin).
⚠️ Identifying inconsistencies often requires semi-automatic or human-in-the-loop checks, especially when integrating data from different sources with varying units or spellings (e.g., address normalization).

Overall Data Preprocessing Effort & Integration
⏱️ Data preprocessing is a "dirty" and time-consuming job, often consuming above 70% of the total time in a data analysis project.
🔗 Data integration involves combining data from multiple sources, often following the ETL (Extraction, Transformation, Loading) workflow for data warehouses or ELT for data lakes.
🧐 Integration issues include schema inconsistency (different attribute names for the same concept) and value inconsistency (same attribute name, different meanings/units); correlation studies can help identify duplicate attributes.

Data Transformation and Reduction
🔢 Data Transformation includes techniques that change the data structure or values, such as normalization and discretization.
⚖️ Normalization scales numeric attributes to a specific range (e.g., Min-Max normalization to 0-1, or Z-score normalization resulting in zero mean and unit standard deviation) to ensure all attributes contribute reasonably to analysis.
🏷️ Discretization converts numeric data to categorical labels, necessary for algorithms like Apriori for frequent pattern mining, often using bin-based discretization (equidistant or equal-frequency bins).
📉 Data Reduction aims to decrease data size without significantly impacting analysis performance, categorized into numerosity reduction (reducing rows/objects) and dimensionality reduction (reducing columns/features).

Numerosity and Dimensionality Reduction
🔀 Numerosity reduction commonly uses sampling, with sampling with replacement generally preferred as it better retains the true data distribution, minimizing bias unless the dataset is extremely large.
🔬 Dimensionality reduction is split into feature subset selection (selecting the best existing attributes greedily based on criteria like classification performance) and finding new dimensions.
🔍 Techniques for finding new dimensions include Principal Component Analysis (PCA), which maximizes variance in the new space, Independent Component Analysis (ICA), and Non-negative Matrix Factorization (NMF).
💾 Quantization is a form of numerosity reduction/compression where the number of bits used to store a value is reduced, improving computational efficiency.

Key Points & Insights
➡️ Data preprocessing focuses on improving quality and shaping data for modeling, often consuming over 70% of project time.
➡️ Data quality assessment requires checking for accuracy, completeness, consistency, timeliness, believability, and interpretability.
➡️ When cleaning missing data, use simple filling techniques while being mindful of the bias introduced; dropping attributes over 70% missing is a common heuristic.
➡️ Normalization (Min-Max or Z-score) is crucial when numeric attributes have vastly different scales (e.g., age vs. salary) to ensure all features contribute fairly to analysis.
➡️ Sampling with replacement is generally the superior method for numerosity reduction as it better maintains the original data distribution compared to sampling without replacement.

📸 Video summarized with SummaryTube.com on Feb 02, 2026, 05:57 UTC

CS 432 - Lecture 4, Data Preprocessing

Related Products

📜Transcript

📄Video Description

Recently Summarized Videos

Related Products

Loading Similar Videos...

Recently Summarized Videos

Get the Chrome Extension