Principal Component Analysis with R Programming

Principal Component Analysis in detail

Principal Component Analysis (PCA) is a technique used to analyze the linear components of all existing attributes in a dataset. Principal components are linear combinations (orthogonal transformations) of the original predictors in the dataset. PCA is widely used in Exploratory Data Analysis (EDA) as it helps in visualizing the variations present in high-dimensional data.

Understanding PCA

The first principal component captures the maximum variance in the dataset and determines the direction of the highest variability. The second principal component captures the remaining variance while being uncorrelated with the first component (PC1). This pattern continues with all succeeding principal components, ensuring that they capture the remaining variance without correlation with previous components.

Dataset

We will use the iris dataset, which is built into R. It contains measurements of sepal length, sepal width, petal length, and petal width for three different species of flowers.

Installing Required Packages

legend(x, y, legend, fill, col, bg, lty, cex, title, text.font)

Loading the Package and Dataset

library(dplyr)
data(iris)
str(iris)

Output:

' data.frame': 150 obs. of  5 variables:
$ Sepal.Length: num  5.1 4.9 4.7 4.6 5 ...
$ Sepal.Width : num  3.5 3 3.2 3.1 3.6 ...
$ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 ...
$ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 ...
$ Species     : Factor w/ 3 levels "setosa","versicolor", "virginica" ...

Principal Component Analysis with R language using dataset

We perform Principal Component Analysis (PCA) on the mtcars dataset, which includes 32 car models and 10 variables.

# Load dataset
data(iris)

# Remove non-numeric column
iris_numeric <- iris[, -5]

# Apply PCA using prcomp function
my_pca <- prcomp(iris_numeric, scale = TRUE, center = TRUE, retx = TRUE)

# View summary
summary(my_pca)

# View principal component loadings
my_pca$rotation

# View transformed principal components
dim(my_pca$x)
my_pca$x

# Plot the resultant principal components
biplot(my_pca, main = "Biplot", scale = 0)

# Compute variance and proportion of variance explained
my_pca.var <- my_pca$sdev^2
propve <- my_pca.var / sum(my_pca.var)

# Scree plot
plot(propve, xlab = "Principal Component",
     ylab = "Proportion of Variance Explained",
     ylim = c(0, 1), type = "b", main = "Scree Plot")

# Cumulative variance plot
plot(cumsum(propve),
     xlab = "Principal Component",
     ylab = "Cumulative Proportion of Variance Explained",
     ylim = c(0, 1), type = "b")

# Find the number of components covering at least 90% variance
which(cumsum(propve) >= 0.9)[1]

# Prepare data for Decision Tree
train.data <- data.frame(Sepal.Length = iris$Sepal.Length, my_pca$x[, 1:4])

# Install and load decision tree packages
install.packages("rpart")
install.packages("rpart.plot")
library(rpart)
library(rpart.plot)

# Build Decision Tree model
rpart.model <- rpart(Sepal.Length ~ ., data = train.data, method = "anova")

# Plot the Decision Tree
rpart.plot(rpart.model)

Output:

Variance explained for each principal component

Cumulative proportion of variance

Decision tree model

Principal Component Analysis with R Programming

Principal Component Analysis in detail

Understanding PCA

Dataset

Principal Component Analysis with R language using dataset

Comments

Leave a Reply Cancel reply

More posts

Balancing CFA Level I and a Full-Time Job: A Practical Roadmap for Working Professionals

Best FRM Coaching Providers: A Detailed, Experience Based Comparison

Best CFA Coaching in India: Honest Review & Comparison of Top CFA Institutes

JavaScript Functions