dplyr Package in R Programming

dplyr Package in detail

The dplyr package in the R programming language is a powerful tool for data manipulation. It provides a streamlined set of functions (or verbs) to handle common data manipulation tasks efficiently and intuitively.

Key Benefits of dplyr

  • Simplifies data manipulation by offering a set of well-defined functions.
  • Speeds up development by enabling concise and readable code.
  • Reduces computational time through optimized backends for data operations.
Data Frames and Tibbles

Data Frames: Data frames in R are structured tables where each column holds data of a specific type, such as names, ages, or scores. You can create a data frame using the following code:

# Create a data frame
students <- data.frame(
  Name = c("Amit", "Priya", "Rohan"),
  Age = c(20, 21, 19),
  Score = c(88, 92, 85)
)
print(students)

Output:

Name Age Score
1  Amit  20    88
2 Priya  21    92
3 Rohan  19    85

Tibbles: Tibbles, introduced by the tibble package, are a modern version of data frames with enhanced features. You can create a tibble as follows:

# Load tibble library
library(tibble)

# Create a tibble
students_tibble <- tibble(
  Name = c("Amit", "Priya", "Rohan"),
  Age = c(20, 21, 19),
  Score = c(88, 92, 85)
)
print(students_tibble)

Pipes (%>%): The pipe operator (%>%) in dplyr allows chaining multiple operations together for improved code readability.

# Load dplyr library
library(dplyr)

# Use pipes to filter, select, group, and summarize data
result <- mtcars %>%
  filter(mpg > 25) %>%       # Filter rows where mpg is greater than 25
  select(mpg, cyl, hp) %>%   # Select specific columns
  group_by(cyl) %>%          # Group data by the 'cyl' variable
  summarise(mean_hp = mean(hp))  # Calculate mean horsepower for each group

print(result)

Output:

cyl mean_hp
  <dbl>   <dbl>
1     4    81.88
Verb Functions in dplyr

1. filter(): Use filter() to select rows based on conditions.

# Create a data frame
data <- data.frame(
  Name = c("Anita", "Rahul", "Sanjay", "Meera"),
  Age = c(28, 25, 30, 24),
  Height = c(5.4, NA, 5.9, NA)
)

# Filter rows with missing Height values
rows_with_na <- data %>% filter(is.na(Height))
print(rows_with_na)

# Filter rows without missing Height values
rows_without_na <- data %>% filter(!is.na(Height))
print(rows_without_na)

Output:

Rows with missing Height:
    Name Age Height
1  Rahul  25     NA
2  Meera  24     NA

Rows without missing Height:
    Name Age Height
1  Anita  28    5.4
2 Sanjay  30    5.9

2. arrange(): Use arrange() to reorder rows based on column values.

# Arrange data by Age in ascending order
sorted_data <- data %>% arrange(Age)
print(sorted_data)

Output:

Name Age Height
1  Meera  24     NA
2  Rahul  25     NA
3  Anita  28    5.4
4 Sanjay  30    5.9

3. select() and rename(): Use select() to choose columns and rename() to rename them.

# Select specific columns
selected_columns <- data %>% select(Name, Age)
print(selected_columns)

# Rename columns
renamed_data <- data %>% rename(FullName = Name, Years = Age)
print(renamed_data)

Output:

Selected Columns:
    Name Age
1  Anita  28
2  Rahul  25
3 Sanjay  30
4  Meera  24

Renamed Columns:
    FullName Years Height
1     Anita    28    5.4
2     Rahul    25     NA
3    Sanjay    30    5.9
4     Meera    24     NA

4. mutate() and transmute(): Use mutate() to add new columns while retaining existing ones. Use transmute() to create new columns and drop others.

# Add a new column (mutate)
mutated_data <- data %>% mutate(BMI = round((Height * 10) / Age, 2))
print(mutated_data)

# Add a new column and drop others (transmute)
transmuted_data <- data %>% transmute(BMI = round((Height * 10) / Age, 2))
print(transmuted_data)

Output:

Mutated Data:
    Name Age Height   BMI
1  Anita  28    5.4  1.93
2  Rahul  25     NA    NA
3 Sanjay  30    5.9  1.97
4  Meera  24     NA    NA

Transmuted Data:
   BMI
1 1.93
2   NA
3 1.97
4   NA

5. summarise(): Use summarise() to condense multiple values into a single summary.

# Calculate the average age
average_age <- data %>% summarise(AverageAge = mean(Age))
print(average_age)

Output:

AverageAge
1       26.75

6. sample_n() and sample_frac():Use these functions to take random samples of rows.

# Take 2 random rows
random_rows <- data %>% sample_n(2)
print(random_rows)

# Take 50% of rows randomly
random_fraction <- data %>% sample_frac(0.5)
print(random_fraction)

Output:

Random Rows:
    Name Age Height
1 Sanjay  30    5.9
2  Meera  24     NA

Random Fraction:
    Name Age Height
1  Anita  28    5.4
2  Rahul  25     NA

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *