Data Collection and Preprocessing

Data Types and Sources

1. Data Types:

  • Structured Data: Organized in a clear, easily searchable format, typically in tables with rows and columns (e.g., databases, spreadsheets).
  • Unstructured Data: Lacks a predefined structure, often text-heavy, such as emails, social media posts, images, or videos.
  • Semi-Structured Data: Contains elements of both structured and unstructured data, like JSON, XML, or log files.
  • Time-Series Data: Data points collected or recorded at specific time intervals, used in financial markets, sensor readings, etc.
  • Geospatial Data: Information about physical objects on Earth, often used in maps and GPS systems.

2. Data Sources:

  • Databases: Relational (e.g., MySQL, PostgreSQL) and non-relational (e.g., MongoDB) databases.
  • APIs: Interfaces provided by services to access their data programmatically (e.g., Twitter API, Google Maps API).
  • Web Scraping: Extracting data from websites using tools like BeautifulSoup or Scrapy.
  • Sensors: IoT devices, wearables, and other hardware that collect real-time data.
  • Public Datasets: Open data repositories like Kaggle, UCI Machine Learning Repository, or government databases.

Tensors:

  • Definition: A tensor is a generalization of vectors and matrices to higher dimensions. Tensors are used in deep learning, physics, and more complex data representations.
  • Notation: Tensors are often denoted by uppercase letters (e.g., T) with indices representing different dimensions, such as TijkT_{ijk}Tijk​.
  • Operations: Tensor operations generalize matrix operations to higher dimensions, including addition, multiplication, and contraction.

Data Cleaning: Handling Missing Values, Outliers

1. Handling Missing Values:

  • Removal:
    • Delete Rows: Remove rows with missing values if they constitute a small portion of the data.
    • Delete Columns: Remove columns with a significant proportion of missing values.
  • Imputation:
    • Mean/Median/Mode Imputation: Replace missing values with the mean, median, or mode of the column.
    • Forward/Backward Fill: Fill missing values with the previous/next observation in time-series data.
    • Interpolation: Estimate missing values based on surrounding data points, particularly in time-series data.
  • Advanced Techniques:
    • K-Nearest Neighbors (KNN) Imputation: Estimate missing values based on similar rows.
    • Multiple Imputation: Generate multiple imputations and average them to handle uncertainty.

2. Handling Outliers:

  • Identification:
    • Z-Score: Outliers are data points with Z-scores greater than a certain threshold (e.g., |Z| > 3).
    • IQR Method: Points lying below Q1 – 1.5IQR or above Q3 + 1.5IQR (where IQR is the interquartile range) are considered outliers.

Feature Engineering: Scaling, Encoding, Selection

Scaling:

  • Standardization: Rescale data to have a mean of 0 and a standard deviation of 1.
  • Min-Max Scaling: Scale data to a fixed range, typically [0, 1].
  • Robust Scaling: Use median and IQR for scaling, which is robust to outliers.

Encoding:

  • One-Hot Encoding: Convert categorical variables into a series of binary columns.
  • Label Encoding: Assign a unique integer to each category.
  • Ordinal Encoding: Encode categorical variables where order matters (e.g., “low”, “medium”, “high”).

Feature Selection:

  • Filter Methods: Select features based on statistical tests like Chi-square or correlation.
  • Wrapper Methods: Use algorithms like Recursive Feature Elimination (RFE) to select features.
  • Embedded Methods: Feature selection occurs during the training of the model, e.g., Lasso regression.

Data Splitting: Training, Validation, and Test Sets

Training Set:

  • Purpose: The portion of data used to train the model. The model learns patterns and relationships from this dataset.
  • Typical Split: 60-80% of the entire dataset.

Validation Set:

  • Purpose: Used to tune model parameters and prevent overfitting by evaluating the model’s performance on unseen data during the training process.
  • Typical Split: 10-20% of the entire dataset.

Test Set:

  • Purpose: Used to evaluate the final model’s performance and generalization ability on completely unseen data.
  • Typical Split: 10-20% of the entire dataset.

Example: Data Cleaning and Splitting in Python

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

# Example data
data = {
    'Age': [25, 30, 45, None, 35, 50, 28, None],
    'Salary': [50000, 54000, 61000, 58000, None, 69000, 72000, 65000],
    'Purchased': ['No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes']
}

# Create DataFrame
df = pd.DataFrame(data)

# 1. Handle missing values (Imputation)
imputer = SimpleImputer(strategy='mean')
df['Age'] = imputer.fit_transform(df[['Age']])
df['Salary'] = imputer.fit_transform(df[['Salary']])

# 2. Encode categorical variables
df['Purchased'] = df['Purchased'].map({'No': 0, 'Yes': 1})

# 3. Scale features
scaler = StandardScaler()
df[['Age', 'Salary']] = scaler.fit_transform(df[['Age', 'Salary']])

# 4. Split data into training, validation, and test sets
X = df[['Age', 'Salary']]
y = df['Purchased']

# Split into train+val and test first
X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Then split train+val into train and validation
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=0.25, random_state=42)

print("Training set size:", len(X_train))
print("Validation set size:", len(X_val))
print("Test set size:", len(X_test))

Validation Set:

  • Purpose: Used to tune model parameters and prevent overfitting by evaluating the model’s performance on unseen data during the training process.
  • Typical Split: 10-20% of the entire dataset.
import numpy as np

# 1. Create a vector
vector = np.array([1, 2, 3])
print("Vector:", vector)

# 2. Create a matrix
matrix = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print("Matrix:\n", matrix)

# 3. Perform vector addition
vector2 = np.array([4, 5, 6])
vector_sum = vector + vector2
print("Vector Addition:", vector_sum)

# 4. Perform scalar multiplication
scalar = 3
scalar_mult = scalar * vector
print("Scalar Multiplication:", scalar_mult)

# 5. Perform matrix multiplication
matrix2 = np.array([[1, 2, 1], [2, 1, 2], [1, 2, 1]])
matrix_mult = np.dot(matrix, matrix2)
print("Matrix Multiplication:\n", matrix_mult)

# 6. Compute dot product of two vectors
dot_product = np.dot(vector, vector2)
print("Dot Product of vectors:", dot_product)

# 7. Find the transpose of a matrix
transpose = np.transpose(matrix)
print("Transpose of Matrix:\n", transpose)
  • Explainable AI and interpretability
  • Federated learning and privacy-preserving ML
  • AI-driven automation and the future of work
  • Ongoing research and emerging trends in AI

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *