Introduction
Factors are a special type of data structure in R used to represent categorical data. Categorical data consists of values that belong to a finite set of categories, such as gender, education level, ratings, or departments.
Factors are extremely important in:
- Statistical modeling
- Data analysis
- Machine learning
- Data visualization
What is a Factor?
A factor is a data structure that stores:
- Levels (unique categories)
- Integer codes that represent these levels
Internally, factors are stored as integers, but displayed as labels.
Why Factors are Important
Factors help R:
- Understand categorical variables
- Apply correct statistical methods
- Optimize memory usage
- Handle ordering of categories properly
Example:
- Gender: Male, Female
- Rating: Low, Medium, High
Creating Factors in R
Using factor() Function
gender <- factor(c("Male", "Female", "Male", "Female"))
print(gender)
Levels of a Factor
Levels are the unique categories in a factor.
levels(gender)
Level Ordering of Factors
By default, levels are ordered alphabetically.
rating <- factor(c("Low", "High", "Medium"))
levels(rating)
Ordered Factors
Ordered factors have a meaningful order.
rating <- factor(
c("Low", "Medium", "High"),
levels = c("Low", "Medium", "High"),
ordered = TRUE
)
Checking Factor Properties
is.factor()
is.factor(rating)
is.ordered()
is.ordered(rating)
Converting Data to Factors
Convert Vector to Factor
x <- c("Yes", "No", "Yes")
f <- as.factor(x)
Convert Factor to Character
as.character(f)
Convert Factor to Numeric
⚠️ Must convert carefully.
as.numeric(levels(f))[f]
Modifying Factor Levels
Renaming Levels
levels(f) <- c("NO", "YES")
Adding New Levels
levels(f) <- c(levels(f), "MAYBE")
Summary of Factors
- Factors represent categorical data
- They store values as integers with labels
- Ordered factors represent ranked categories
- Essential for statistical analysis and modeling
Common Mistakes with Factors
- Converting factor directly to numeric
- Forgetting to define level order
- Treating factors as strings
Summary
Factors are a core data structure in R used for categorical data. They play a critical role in statistical modeling and data analysis by ensuring that categorical variables are handled correctly and efficiently.