Principal Component Analysis in detail
Principal Component Analysis (PCA) is a technique used to analyze the linear components of all existing attributes in a dataset. Principal components are linear combinations (orthogonal transformations) of the original predictors in the dataset. PCA is widely used in Exploratory Data Analysis (EDA) as it helps in visualizing the variations present in high-dimensional data.
Understanding PCA
The first principal component captures the maximum variance in the dataset and determines the direction of the highest variability. The second principal component captures the remaining variance while being uncorrelated with the first component (PC1). This pattern continues with all succeeding principal components, ensuring that they capture the remaining variance without correlation with previous components.
Dataset
We will use the iris dataset, which is built into R. It contains measurements of sepal length, sepal width, petal length, and petal width for three different species of flowers.
Installing Required Packages
legend(x, y, legend, fill, col, bg, lty, cex, title, text.font)
Loading the Package and Dataset
library(dplyr)
data(iris)
str(iris)
Output:
' data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 ...
$ Species : Factor w/ 3 levels "setosa","versicolor", "virginica" ...
Principal Component Analysis with R language using dataset
We perform Principal Component Analysis (PCA) on the mtcars dataset, which includes 32 car models and 10 variables.
# Load dataset
data(iris)
# Remove non-numeric column
iris_numeric <- iris[, -5]
# Apply PCA using prcomp function
my_pca <- prcomp(iris_numeric, scale = TRUE, center = TRUE, retx = TRUE)
# View summary
summary(my_pca)
# View principal component loadings
my_pca$rotation
# View transformed principal components
dim(my_pca$x)
my_pca$x
# Plot the resultant principal components
biplot(my_pca, main = "Biplot", scale = 0)
# Compute variance and proportion of variance explained
my_pca.var <- my_pca$sdev^2
propve <- my_pca.var / sum(my_pca.var)
# Scree plot
plot(propve, xlab = "Principal Component",
ylab = "Proportion of Variance Explained",
ylim = c(0, 1), type = "b", main = "Scree Plot")
# Cumulative variance plot
plot(cumsum(propve),
xlab = "Principal Component",
ylab = "Cumulative Proportion of Variance Explained",
ylim = c(0, 1), type = "b")
# Find the number of components covering at least 90% variance
which(cumsum(propve) >= 0.9)[1]
# Prepare data for Decision Tree
train.data <- data.frame(Sepal.Length = iris$Sepal.Length, my_pca$x[, 1:4])
# Install and load decision tree packages
install.packages("rpart")
install.packages("rpart.plot")
library(rpart)
library(rpart.plot)
# Build Decision Tree model
rpart.model <- rpart(Sepal.Length ~ ., data = train.data, method = "anova")
# Plot the Decision Tree
rpart.plot(rpart.model)
Output:

Variance explained for each principal component

Cumulative proportion of variance

Decision tree model

Leave a Reply