dplyr Package in detail
The dplyr package in the R programming language is a powerful tool for data manipulation. It provides a streamlined set of functions (or verbs) to handle common data manipulation tasks efficiently and intuitively.
Key Benefits of dplyr
- Simplifies data manipulation by offering a set of well-defined functions.
- Speeds up development by enabling concise and readable code.
- Reduces computational time through optimized backends for data operations.
Data Frames and Tibbles
Data Frames: Data frames in R are structured tables where each column holds data of a specific type, such as names, ages, or scores. You can create a data frame using the following code:
# Create a data frame
students <- data.frame(
Name = c("Amit", "Priya", "Rohan"),
Age = c(20, 21, 19),
Score = c(88, 92, 85)
)
print(students)
Output:
Name Age Score
1 Amit 20 88
2 Priya 21 92
3 Rohan 19 85
Tibbles: Tibbles, introduced by the tibble package, are a modern version of data frames with enhanced features. You can create a tibble as follows:
# Load tibble library
library(tibble)
# Create a tibble
students_tibble <- tibble(
Name = c("Amit", "Priya", "Rohan"),
Age = c(20, 21, 19),
Score = c(88, 92, 85)
)
print(students_tibble)
Pipes (%>%): The pipe operator (%>%) in dplyr allows chaining multiple operations together for improved code readability.
# Load dplyr library
library(dplyr)
# Use pipes to filter, select, group, and summarize data
result <- mtcars %>%
filter(mpg > 25) %>% # Filter rows where mpg is greater than 25
select(mpg, cyl, hp) %>% # Select specific columns
group_by(cyl) %>% # Group data by the 'cyl' variable
summarise(mean_hp = mean(hp)) # Calculate mean horsepower for each group
print(result)
Output:
cyl mean_hp
<dbl> <dbl>
1 4 81.88
Verb Functions in dplyr
1. filter(): Use filter() to select rows based on conditions.
# Create a data frame
data <- data.frame(
Name = c("Anita", "Rahul", "Sanjay", "Meera"),
Age = c(28, 25, 30, 24),
Height = c(5.4, NA, 5.9, NA)
)
# Filter rows with missing Height values
rows_with_na <- data %>% filter(is.na(Height))
print(rows_with_na)
# Filter rows without missing Height values
rows_without_na <- data %>% filter(!is.na(Height))
print(rows_without_na)
Output:
Rows with missing Height:
Name Age Height
1 Rahul 25 NA
2 Meera 24 NA
Rows without missing Height:
Name Age Height
1 Anita 28 5.4
2 Sanjay 30 5.9
2. arrange(): Use arrange() to reorder rows based on column values.
# Arrange data by Age in ascending order
sorted_data <- data %>% arrange(Age)
print(sorted_data)
Output:
Name Age Height
1 Meera 24 NA
2 Rahul 25 NA
3 Anita 28 5.4
4 Sanjay 30 5.9
3. select() and rename(): Use select() to choose columns and rename() to rename them.
# Select specific columns
selected_columns <- data %>% select(Name, Age)
print(selected_columns)
# Rename columns
renamed_data <- data %>% rename(FullName = Name, Years = Age)
print(renamed_data)
Output:
Selected Columns:
Name Age
1 Anita 28
2 Rahul 25
3 Sanjay 30
4 Meera 24
Renamed Columns:
FullName Years Height
1 Anita 28 5.4
2 Rahul 25 NA
3 Sanjay 30 5.9
4 Meera 24 NA
4. mutate() and transmute(): Use mutate() to add new columns while retaining existing ones. Use transmute() to create new columns and drop others.
# Add a new column (mutate)
mutated_data <- data %>% mutate(BMI = round((Height * 10) / Age, 2))
print(mutated_data)
# Add a new column and drop others (transmute)
transmuted_data <- data %>% transmute(BMI = round((Height * 10) / Age, 2))
print(transmuted_data)
Output:
Mutated Data:
Name Age Height BMI
1 Anita 28 5.4 1.93
2 Rahul 25 NA NA
3 Sanjay 30 5.9 1.97
4 Meera 24 NA NA
Transmuted Data:
BMI
1 1.93
2 NA
3 1.97
4 NA
5. summarise(): Use summarise() to condense multiple values into a single summary.
# Calculate the average age
average_age <- data %>% summarise(AverageAge = mean(Age))
print(average_age)
Output:
AverageAge
1 26.75
6. sample_n() and sample_frac():Use these functions to take random samples of rows.
# Take 2 random rows
random_rows <- data %>% sample_n(2)
print(random_rows)
# Take 50% of rows randomly
random_fraction <- data %>% sample_frac(0.5)
print(random_fraction)
Output:
Random Rows:
Name Age Height
1 Sanjay 30 5.9
2 Meera 24 NA
Random Fraction:
Name Age Height
1 Anita 28 5.4
2 Rahul 25 NA
Leave a Reply