Data Types and Sources
1. Data Types:
- Structured Data: Organized in a clear, easily searchable format, typically in tables with rows and columns (e.g., databases, spreadsheets).
- Unstructured Data: Lacks a predefined structure, often text-heavy, such as emails, social media posts, images, or videos.
- Semi-Structured Data: Contains elements of both structured and unstructured data, like JSON, XML, or log files.
- Time-Series Data: Data points collected or recorded at specific time intervals, used in financial markets, sensor readings, etc.
- Geospatial Data: Information about physical objects on Earth, often used in maps and GPS systems.
2. Data Sources:
- Databases: Relational (e.g., MySQL, PostgreSQL) and non-relational (e.g., MongoDB) databases.
- APIs: Interfaces provided by services to access their data programmatically (e.g., Twitter API, Google Maps API).
- Web Scraping: Extracting data from websites using tools like BeautifulSoup or Scrapy.
- Sensors: IoT devices, wearables, and other hardware that collect real-time data.
- Public Datasets: Open data repositories like Kaggle, UCI Machine Learning Repository, or government databases.
Tensors:
- Definition: A tensor is a generalization of vectors and matrices to higher dimensions. Tensors are used in deep learning, physics, and more complex data representations.
- Notation: Tensors are often denoted by uppercase letters (e.g., T) with indices representing different dimensions, such as TijkT_{ijk}Tijk.
- Operations: Tensor operations generalize matrix operations to higher dimensions, including addition, multiplication, and contraction.
Data Cleaning: Handling Missing Values, Outliers
1. Handling Missing Values:
- Removal:
- Delete Rows: Remove rows with missing values if they constitute a small portion of the data.
- Delete Columns: Remove columns with a significant proportion of missing values.
- Imputation:
- Mean/Median/Mode Imputation: Replace missing values with the mean, median, or mode of the column.
- Forward/Backward Fill: Fill missing values with the previous/next observation in time-series data.
- Interpolation: Estimate missing values based on surrounding data points, particularly in time-series data.
- Advanced Techniques:
- K-Nearest Neighbors (KNN) Imputation: Estimate missing values based on similar rows.
- Multiple Imputation: Generate multiple imputations and average them to handle uncertainty.
2. Handling Outliers:
- Identification:
- Z-Score: Outliers are data points with Z-scores greater than a certain threshold (e.g., |Z| > 3).
- IQR Method: Points lying below Q1 – 1.5IQR or above Q3 + 1.5IQR (where IQR is the interquartile range) are considered outliers.
Feature Engineering: Scaling, Encoding, Selection
Scaling:
- Standardization: Rescale data to have a mean of 0 and a standard deviation of 1.
- Min-Max Scaling: Scale data to a fixed range, typically [0, 1].
- Robust Scaling: Use median and IQR for scaling, which is robust to outliers.
Encoding:
- One-Hot Encoding: Convert categorical variables into a series of binary columns.
- Label Encoding: Assign a unique integer to each category.
- Ordinal Encoding: Encode categorical variables where order matters (e.g., “low”, “medium”, “high”).
Feature Selection:
- Filter Methods: Select features based on statistical tests like Chi-square or correlation.
- Wrapper Methods: Use algorithms like Recursive Feature Elimination (RFE) to select features.
- Embedded Methods: Feature selection occurs during the training of the model, e.g., Lasso regression.
Data Splitting: Training, Validation, and Test Sets
Training Set:
- Purpose: The portion of data used to train the model. The model learns patterns and relationships from this dataset.
- Typical Split: 60-80% of the entire dataset.
Validation Set:
- Purpose: Used to tune model parameters and prevent overfitting by evaluating the model’s performance on unseen data during the training process.
- Typical Split: 10-20% of the entire dataset.
Test Set:
- Purpose: Used to evaluate the final model’s performance and generalization ability on completely unseen data.
- Typical Split: 10-20% of the entire dataset.
Example: Data Cleaning and Splitting in Python
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
# Example data
data = {
'Age': [25, 30, 45, None, 35, 50, 28, None],
'Salary': [50000, 54000, 61000, 58000, None, 69000, 72000, 65000],
'Purchased': ['No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes']
}
# Create DataFrame
df = pd.DataFrame(data)
# 1. Handle missing values (Imputation)
imputer = SimpleImputer(strategy='mean')
df['Age'] = imputer.fit_transform(df[['Age']])
df['Salary'] = imputer.fit_transform(df[['Salary']])
# 2. Encode categorical variables
df['Purchased'] = df['Purchased'].map({'No': 0, 'Yes': 1})
# 3. Scale features
scaler = StandardScaler()
df[['Age', 'Salary']] = scaler.fit_transform(df[['Age', 'Salary']])
# 4. Split data into training, validation, and test sets
X = df[['Age', 'Salary']]
y = df['Purchased']
# Split into train+val and test first
X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Then split train+val into train and validation
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=0.25, random_state=42)
print("Training set size:", len(X_train))
print("Validation set size:", len(X_val))
print("Test set size:", len(X_test))
Validation Set:
- Purpose: Used to tune model parameters and prevent overfitting by evaluating the model’s performance on unseen data during the training process.
- Typical Split: 10-20% of the entire dataset.
import numpy as np
# 1. Create a vector
vector = np.array([1, 2, 3])
print("Vector:", vector)
# 2. Create a matrix
matrix = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print("Matrix:\n", matrix)
# 3. Perform vector addition
vector2 = np.array([4, 5, 6])
vector_sum = vector + vector2
print("Vector Addition:", vector_sum)
# 4. Perform scalar multiplication
scalar = 3
scalar_mult = scalar * vector
print("Scalar Multiplication:", scalar_mult)
# 5. Perform matrix multiplication
matrix2 = np.array([[1, 2, 1], [2, 1, 2], [1, 2, 1]])
matrix_mult = np.dot(matrix, matrix2)
print("Matrix Multiplication:\n", matrix_mult)
# 6. Compute dot product of two vectors
dot_product = np.dot(vector, vector2)
print("Dot Product of vectors:", dot_product)
# 7. Find the transpose of a matrix
transpose = np.transpose(matrix)
print("Transpose of Matrix:\n", transpose)
- Explainable AI and interpretability
- Federated learning and privacy-preserving ML
- AI-driven automation and the future of work
- Ongoing research and emerging trends in AI
Leave a Reply