The lines() function in R is used to add lines of different types, colors, and widths to an existing plot.
Syntax:
lines(x, y, col, lwd, lty)
Parameters:
x, y: Vectors of coordinates
col: Color of the line
lwd: Width of the line
lty: Type of line
Adding Lines to a Plot using lines() Function
Example 1: Adding a Line to a Scatter Plot
This example demonstrates how to create a scatter plot and add a line to it.
# Creating coordinate vectors
x <- c(2.1, 4.2, 1.5, -2.8, 6.3,
3.1, 4.0, 2.8, 2.6, 2.2, 2.0, 2.8)
y <- c(3.2, 6.5, 2.8, -2.5, 10.5, 4.8,
5.9, 5.1, 3.9, 3.2, 3.4, 4.8)
# Plotting the scatter plot
plot(x, y, cex = 1, pch = 3, xlab = "X-axis",
ylab = "Y-axis", col = "black")
# Creating another set of coordinates for the line
x2 <- c(3.5, 1.0, -1.8, 0.2)
y2 <- c(4.0, 5.2, 3.0, 3.5)
# Adding a red line to the plot
lines(x2, y2, col = "red", lwd = 2, lty = 1)
Output:
Example 2: Connecting Points with lines()
This example shows how to plot a scatter plot and connect the points using lines().
# Creating coordinate vectors
x <- c(2.1, 4.2, 1.5, -2.8, 6.3, 3.1,
4.0, 2.8, 2.6, 2.2, 2.0, 2.8)
y <- c(3.2, 6.5, 2.8, -2.5, 10.5, 4.8,
5.9, 5.1, 3.9, 3.2, 3.4, 4.8)
# Plotting the scatter plot
plot(x, y, cex = 1, pch = 3, xlab = "X-axis",
ylab = "Y-axis", col = "black")
# Connecting points with a red line
lines(x, y, col = "red")
Output:
Example: Adding Lines to a Plot in R using lines()
# Create sample data
x <- seq(-5, 5, length.out = 10)
y <- x^3
# Create a plot of the data
plot(x, y, main = "Adding Lines to a Plot", col = "blue")
# Add a vertical line at x = 0
abline(v = 0, col = "green", lwd = 2)
# Add a horizontal line at y = 0
abline(h = 0, col = "purple", lwd = 2)
# Add a diagonal line with slope -2 and intercept 3
abline(a = 3, b = -2, col = "orange", lty = 2, lwd = 2)
# Add a custom line using lines() function
x2 <- seq(-5, 5, length.out = 10)
y2 <- -x2^2 + 4
lines(x2, y2, col = "red", lty = 2, lwd = 2)
The abline() function in R is used to add one or more straight lines to a graph. It can be used to add vertical, horizontal, or regression lines to a plot.
Syntax:
abline(a=NULL, b=NULL, h=NULL, v=NULL, ...)
Parameters:
a, b: Specifies the intercept and the slope of the line.
h: Specifies y-value(s) for horizontal line(s).
v: Specifies x-value(s) for vertical line(s).
Returns:
A straight line in the plot.
Example 1: Adding a Vertical Line to the Plot
# Create scatter plot
plot(pressure)
# Add vertical line at x = 200
abline(v = 200, col = "blue")
Output:
Example 2: Adding a Horizontal Line to the Plot
# Create scatter plot
plot(pressure)
# Add horizontal line at y = 300
abline(h = 300, col = "red")
Output:
Example 3: Adding a Regression Line
par(mgp = c(2, 1, 0), mar = c(3, 3, 1, 1))
# Fit regression line
reg <- lm(pressure ~ temperature, data = pressure)
coeff = coefficients(reg)
# Equation of the line
eq = paste0("y = ", round(coeff[1], 1), " + ", round(coeff[2], 1), "*x")
# Plot
plot(pressure, main = eq)
abline(reg, col = "darkgreen")
A line graph is a chart used to display information in the form of a series of data points. It utilizes points and lines to represent changes over time. Line graphs are created by plotting different points on their X and Y coordinates and joining them with a line from beginning to end. The graph represents different values that may move up and down based on the suitable variable.
Creating Line Graphs in R
The plot() function in R is used to create line graphs.
Syntax:
plot(v, type, col, xlab, ylab)
Bar Plot (Bar Chart)
A bar plot in R represents values in a data vector as the height of bars. The data vector is mapped on the y-axis, and categories can be labeled on the x-axis. Bar charts can also resemble histograms when using the table() function instead of a data vector.
Syntax:
plot(v, type, col, xlab, ylab)
Parameters:
v: A numeric vector representing the data points.
type: Specifies the type of graph:
"p" : Draws only points.
"l" : Draws only lines.
"o" : Draws both points and lines.
xlab: Label for the X-axis.
ylab: Label for the Y-axis.
main: Title of the chart.
col: Specifies colors for the points and lines.
Example 1: Creating a Simple Line Graph
This example creates a simple line graph using the type = "o" parameter to show both points and lines.
Code:
# Create the data for the chart.
sales <- c(10, 15, 22, 18, 30)
# Plot the line graph.
plot(sales, type = "o")
Output:
Example 2: Adding Title, Color, and Labels in a Line Graph
To enhance readability, we can add a title, axis labels, and color to the graph.
Code:
# Create the data for the chart.
sales <- c(10, 15, 22, 18, 30)
# Plot the line graph with title and labels.
plot(sales, type = "o", col = "blue",
xlab = "Month", ylab = "Sales (in units)",
main = "Monthly Sales Chart")
Output:
To compare multiple datasets, we can plot multiple lines on the same graph using the lines() function.
Code:
# Defining a vector with counts of different fruits
counts <- c(120, 300, 150, 80, 45, 95)
# Defining labels for each segment
names(counts) <- c("Apples", "Bananas", "Oranges", "Grapes", "Mangoes", "Pineapples")
# Output to be saved as PNG file
png(file = "piechart.png")
# Creating pie chart
pie(counts, labels = names(counts), col = "lightblue",
main = "Fruit Distribution", radius = -1,
col.main = "black")
# Saving the file
dev.off()
Data visualization with R and ggplot2, also known as the Grammar of Graphics, is a free, open-source, and user-friendly visualization package widely utilized in the R programming language. Created by Hadley Wickham, it is one of the most powerful tools for data visualization.
Key Layers of ggplot2
The ggplot2 package operates on several layers, which include:
Data: The dataset used for visualization.
Aesthetics: Mapping data attributes to visual properties such as x-axis, y-axis, color, fill, size, labels, alpha, shape, line width, and line type.
Geometric Objects: How data is represented visually, such as points, lines, histograms, bars, or boxplots.
Facets: Splitting data into subsets displayed in separate panels using rows or columns.
Statistics: Applying transformations like binning, smoothing, or descriptive summaries.
Coordinates: Mapping data points to specific spaces (e.g., Cartesian, fixed, polar) and adjusting limits.
Themes: Customizing non-data elements like font size, background, and color.
Dataset Used: mtcars
The mtcars dataset contains fuel consumption and 10 other automobile design and performance attributes for 32 cars. It comes pre-installed with the R environment.
Viewing the First Few Records
# Print the first 6 records of the dataset
head(mtcars)
# Load dplyr package and get a summary of the dataset
library(dplyr)
# Summary of the dataset
summary(mtcars)
Output:
Variable
Min
1st Quartile
Median
Mean
3rd Quartile
Max
mpg
10.4
15.43
19.20
20.09
22.80
33.90
cyl
4.0
4.0
6.0
6.19
8.0
8.0
disp
71.1
120.8
196.3
230.7
326.0
472.0
hp
52.0
96.5
123.0
146.7
180.0
335.0
drat
2.76
3.08
3.70
3.60
3.92
4.93
wt
1.51
2.58
3.32
3.22
3.61
5.42
qsec
14.5
16.89
17.71
17.85
18.90
22.90
vs
0.0
0.0
0.0
0.44
1.0
1.0
am
0.0
0.0
0.0
0.41
1.0
1.0
gear
3.0
3.0
4.0
3.69
4.0
5.0
carb
1.0
2.0
2.0
2.81
4.0
8.0
Visualizing Data with ggplot2
Data Layer: The data layer specifies the dataset to visualize.
# Load ggplot2 and define the data layer
library(ggplot2)
ggplot(data = mtcars) +
labs(title = "Visualization of MTCars Data")
Output:
Aesthetic Layer: Mapping data to visual attributes such as axes, color, or shape.
# Add aesthetics
ggplot(data = mtcars, aes(x = hp, y = mpg, col = disp)) +
labs(title = "Horsepower vs Miles per Gallon")
Output:
Geometric Layer: Adding geometric shapes to display the data.
# Plot data using points
plot1 <- ggplot(data = mtcars, aes(x = hp, y = mpg, col = disp)) +
geom_point() +
labs(title = "Horsepower vs Miles per Gallon", x = "Horsepower", y = "Miles per Gallon")
Output:
Faceting: Create separate plots for subsets of data.
# Facet by transmission type
facet_plot <- ggplot(data = mtcars, aes(x = hp, y = mpg, shape = factor(cyl))) +
geom_point()
facet_grid()}
Output:
Statistics Layer: The statistics layer in ggplot2 allows you to transform your data by applying methods like binning, smoothing, or descriptive statistics.
# Scatter plot with a regression line
ggplot(data = mtcars, aes(x = hp, y = mpg)) +
geom_point() +
stat_smooth(method = lm, col = "blue") +
labs(title = "Relationship Between Horsepower and Miles per Gallon")
Output:
Coordinates Layer: In this layer, data coordinates are mapped to the plot’s visual space. Adjustments to axes, zooming, and proportional scaling of the plot can also be made here.
# Scatter plot with controlled axis limits
ggplot(data = mtcars, aes(x = wt, y = mpg)) +
geom_point() +
stat_smooth(method = lm, col = "green") +
scale_y_continuous("Miles per Gallon", limits = c(5, 35), expand = c(0, 0)) +
scale_x_continuous("Weight", limits = c(1, 6), expand = c(0, 0)) +
coord_equal() +
labs(title = "Effect of Weight on Fuel Efficiency")
Output:
Using coord_cartesian() to Zoom In
# Zoom into specific x-axis and y-axis ranges
ggplot(data = mtcars, aes(x = wt, y = hp, col = as.factor(am))) +
geom_point() +
geom_smooth() +
coord_cartesian(xlim = c(3, 5), ylim = c(100, 300)) +
labs(title = "Zoomed View: Horsepower vs Weight",
x = "Weight",
y = "Horsepower",
color = "Transmission")
Output:
Theme Layer: The theme layer in ggplot2 allows fine control over display elements like background color, font size, and overall styling.
Example 1: Customizing the Background with element_rect()
Contour Plot for the mtcars Dataset: Create a density contour plot to visualize the relationship between two continuous variables.
# 2D density contour plot
ggplot(mtcars, aes(x = wt, y = mpg)) +
stat_density_2d(aes(fill = ..level..), geom = "polygon", color = "black") +
scale_fill_viridis_c() +
labs(title = "2D Density Contour: Weight vs MPG",
x = "Weight",
y = "Miles per Gallon",
fill = "Density Levels") +
theme_minimal()
Output:
Creating a Panel of Plots: Create multiple plots and arrange them in a grid for side-by-side visualization.
library(gridExtra)
# Histograms for selected variables
hist_plot_mpg <- ggplot(mtcars, aes(x = mpg)) +
geom_histogram(binwidth = 2, fill = "steelblue", color = "black") +
labs(title = "Miles per Gallon Distribution", x = "MPG", y = "Frequency")
hist_plot_disp <- ggplot(mtcars, aes(x = disp)) +
geom_histogram(binwidth = 50, fill = "darkred", color = "black") +
labs(title = "Displacement Distribution", x = "Displacement", y = "Frequency")
hist_plot_hp <- ggplot(mtcars, aes(x = hp)) +
geom_histogram(binwidth = 20, fill = "forestgreen", color = "black") +
labs(title = "Horsepower Distribution", x = "Horsepower", y = "Frequency")
hist_plot_drat <- ggplot(mtcars, aes(x = drat)) +
geom_histogram(binwidth = 0.5, fill = "orange", color = "black") +
labs(title = "Drat Distribution", x = "Drat", y = "Frequency")
# Arrange plots in a 2x2 grid
grid.arrange(hist_plot_mpg, hist_plot_disp, hist_plot_hp, hist_plot_drat, ncol = 2)
Output:
Saving and Extracting Plots
To save plots as image files or reuse them later:
# Create a plot
plot <- ggplot(data = mtcars, aes(x = hp, y = mpg)) +
geom_point() +
labs(title = "Horsepower vs MPG")
# Save the plot as PNG
ggsave("horsepower_vs_mpg.png", plot)
# Save the plot as PDF
ggsave("horsepower_vs_mpg.pdf", plot)
# Extract the plot for reuse
extracted_plot <- plot
plot
Data Visualization is the process of converting raw data into visual representations such as graphs, charts, and plots so that information can be understood quickly and clearly. Humans understand visuals far more efficiently than tables of numbers, which makes visualization a critical step in data analysis.
In R, data visualization is one of the strongest features because R was originally designed for statistical analysis and graphical modeling. Visualization is not only used to present final results, but also to explore data, identify trends, patterns, anomalies, and relationships before applying models.
Why Data Visualization is Important
Simplifies complex datasets
Reveals hidden patterns and trends
Helps detect outliers and errors
Improves communication of results
Supports decision-making
Graph Plotting in R
What is Graph Plotting?
Graph plotting refers to creating visual representations of data values using graphical elements such as points, lines, bars, or shapes. In R, graph plotting is mainly done using:
Base R graphics
Advanced systems like ggplot2, lattice
Base R graphics are foundational and widely used for learning concepts.
Generic Plotting System in R
R uses a generic plotting system, where the same function behaves differently based on the data type.
The most important generic function is:
plot()
The plot() function automatically determines:
Type of plot
Axis scaling
Labels (if available)
This behavior is called method dispatch.
Using the plot() Function
Basic Syntax
plot(x, y)
Example
x <- c(1, 2, 3, 4, 5)
y <- c(2, 4, 6, 8, 10)
plot(x, y)
This produces a scatter plot, showing the relationship between x and y.
Types of Plots Using plot()
Scatter Plot
Used to analyze relationships between two numerical variables.
plot(x, y, type = "p")
Line Plot
Used to show trends over time or ordered data.
plot(x, y, type = "l")
Combined Points and Lines
plot(x, y, type = "b")
Vertical Line Plot
plot(x, y, type = "h")
Graphical Models in R
Introduction to Graphical Models
Graphical models in R are visual representations of statistical data and relationships. They are used to:
Understand data distribution
Visualize correlations
Validate statistical assumptions
Analyze model performance
Graphical models include:
Scatter plots
Histograms
Boxplots
Regression plots
Residual plots
Example: Visualizing a Relationship
plot(mtcars$wt, mtcars$mpg)
This graph shows how car weight affects mileage, a common statistical analysis.
Charts and Graphs in R
Common Chart Types
Chart Type
Purpose
Line graph
Trends over time
Bar chart
Category comparison
Histogram
Distribution
Scatter plot
Relationship
Boxplot
Spread and outliers
Choosing the correct chart is crucial to avoid misleading interpretation.
Adding Titles to a Graph
Main Title
The main title describes what the graph represents.
plot(x, y, main = "Relationship Between X and Y")
Axis Labels
Axis labels explain what each axis represents.
plot(x, y,
main = "Sales Growth",
xlab = "Months",
ylab = "Revenue")
Clear labels are essential for readability.
Adding Colors to Charts
Importance of Colors
Colors:
Improve readability
Highlight differences
Separate categories
Make graphs visually appealing
Using col Argument
plot(x, y, col = "blue")
Using Multiple Colors
plot(x, y, col = c("red", "green", "blue", "orange", "black"))
Each point gets a different color.
Color in Bar Charts
barplot(scores, col = "skyblue")
Adding Text to Plots
Using text()
Used to label data points.
plot(x, y)
text(x, y, labels = y, pos = 3)
pos controls label position
Helps annotate important values
Using mtext()
Adds text in margins.
mtext("Data Source: Survey", side = 1, line = 3)
Adding Axis to a Plot
Default Axes
R automatically generates axes based on data range.
Custom Axes
Disable default axes:
plot(x, y, xaxt = "n", yaxt = "n")
Add custom axes:
axis(1, at = 1:5)
axis(2, at = seq(0, 10, 2))
box()
Custom axes provide better control.
Axis Limits
Set axis limits manually:
plot(x, y, xlim = c(0, 6), ylim = c(0, 12))
Graphics Palette in R
What is a Graphics Palette?
A graphics palette defines the set of colors used when multiple colors are needed automatically.
View Current Palette
palette()
Set a Custom Palette
palette(c("red", "blue", "green", "orange"))
Reset:
palette("default")
Plotting Data Using Generic Plots
Plotting a Single Vector
v <- c(5, 10, 15, 20)
plot(v)
R plots index vs value.
Plotting Two Vectors
plot(x, y)
Plotting Data Frames
plot(mtcars)
This creates multiple pairwise plots.
Bar Charts in R
Introduction to Bar Charts
A bar chart displays data using rectangular bars. The length of each bar represents the value of a category.
Data visualization in R is a powerful tool for exploring and communicating data. Base R graphics provide flexible and customizable plotting options. Understanding titles, colors, axes, text annotations, palettes, and bar charts ensures clear, accurate, and effective visual communication.
The sqldf package in R enables seamless manipulation of data frames using SQL commands. It provides an efficient way to work with structured data and can be used to interact with a limited range of databases. Instead of using table names as in traditional SQL, sqldf allows you to specify data frame names, making it easy to execute queries within R.
Key Operations of sqldf
When executing an SQL statement on a data frame using sqldf, the following steps occur:
A temporary database is created with an appropriate schema.
The data frames are automatically loaded into this database.
The SQL query is executed.
The resulting output is returned as a new data frame in R.
The temporary database is automatically deleted after execution.
This approach optimizes calculations and improves efficiency by leveraging SQL operations.
install.packages("sqldf")
library(sqldf)
Loading Sample Data
For demonstration, we use two CSV files:
accidents.csv: Contains Year, Highway, Crash_Count, and Traffic.
routes.csv: Contains Highway, Region, and Distance.
Highway Region Distance
1 Highway-101 North Zone 200
2 Highway-405 South Zone 150
SQL Operations with sqldf
1. Performing a Left Join
library(tcltk)
join_query <- "SELECT accidents.*, routes.Region, routes.Distance
FROM accidents
LEFT JOIN routes ON accidents.Highway = routes.Highway"
accidents_routes <- sqldf(join_query, stringsAsFactors = FALSE)
head(accidents_routes)
tail(accidents_routes)
Sample Output:
Year Highway Crash_Count Traffic Region Distance
1 2000 Highway-101 30 50000 North Zone 200
2 2001 Highway-101 35 52000 North Zone 200
3 2002 Highway-101 40 54000 North Zone 200
A database is a structured collection of organized data that allows easy access, storage, and management. It can be handled using a Database Management System (DBMS), which is specialized software for managing databases efficiently. A database contains related and structured data that can be stored and retrieved when needed.
A database primarily supports data storage, retrieval, and manipulation through various sublanguages:
Data Definition Language (DDL)
Data Query Language (DQL)
Data Manipulation Language (DML)
Data Control Language (DCL)
Transaction Control Language (TCL)
Step 1: Install MySQL
To begin, download and install MySQL from its official website:
Once installed, create a new database in MySQL using the following command:
CREATE DATABASE studentDB;
Step 2: Install R Studio
To write and execute R scripts, install RStudio from:
CREATE DATABASE studentDB;
Step 3: Install MySQL Library in R
In RStudio, install the MySQL package with the command:
install.packages("RMySQL")
Now, execute the following R script to connect MySQL with R:
# Load the RMySQL library
library(RMySQL)
# Establish a connection to MySQL database
mysql_connection = dbConnect(MySQL(),
user = 'root',
password = 'root',
dbname = 'studentDB',
host = 'localhost')
# List available tables in the database
dbListTables(mysql_connection)
# Creating a table in MySQL database
dbSendQuery(mysql_connection, "CREATE TABLE students (id INT, name VARCHAR(20));")
# Inserting records into the table
dbSendQuery(mysql_connection, "INSERT INTO students VALUES (201, 'Rahul');")
dbSendQuery(mysql_connection, "INSERT INTO students VALUES (202, 'Neha');")
dbSendQuery(mysql_connection, "INSERT INTO students VALUES (203, 'Ankit');")
# Retrieving records from the table
query_result = dbSendQuery(mysql_connection, "SELECT * FROM students")
# Storing result in an R data frame
data_frame = fetch(query_result)
# Displaying the data frame
print(data_frame)
In R, working with datasets is a crucial aspect of statistical analysis and visualization. Instead of manually creating datasets in the console each time, we can retrieve structured and normalized data directly from relational databases such as MySQL, Oracle, and SQL Server. This integration allows for seamless data manipulation and visualization within R.
This guide focuses on MySQL connectivity in R, covering database connection, table creation, deletion, data insertion, updating, and querying.
RMySQL Package
R provides the RMySQL package to facilitate communication between R and MySQL databases. This package needs to be installed and loaded before connecting to MySQL.
Installation
install.packages("RMySQL")
Establishing Connection to MySQL
To connect to MySQL, the dbConnect() function is used, which requires a database driver along with authentication credentials such as username, password, database name, and host details.
Data can be inserted into a MySQL table from R using SQL INSERT INTO queries.
Example: Inserting Data
# Establish connection
conn <- dbConnect(MySQL(), user = 'admin', password = 'mypassword',
dbname = 'SampleDB', host = 'localhost')
# Insert new record into employees table
dbSendQuery(conn, "INSERT INTO employees(id, name) VALUES (1, 'John Doe')")
Output:
<MySQLResult:9845732, 3, 5>
Updating Data in a MySQL Table Using R
An existing record in the table can be modified using the UPDATE query.
Example: Updating a Table
# Establish connection
conn <- dbConnect(MySQL(), user = 'admin', password = 'mypassword',
dbname = 'SampleDB', host = 'localhost')
# Update a record in employees table
dbSendQuery(conn, "UPDATE employees SET name = 'Jane Doe' WHERE id = 1")
Output:
<MySQLResult:-1, 3, 6>
Retrieving Data from MySQL Using R
To fetch data from MySQL, the dbSendQuery() function is used to send a SQL SELECT statement. The retrieved data can be stored in a dataframe using the fetch() function.
Example:
# Establish connection
conn <- dbConnect(MySQL(), user = 'admin', password = 'mypassword',
dbname = 'SampleDB', host = 'localhost')
# Fetch records from employees table
res <- dbSendQuery(conn, "SELECT * FROM employees")
# Retrieve first 3 rows as dataframe
df <- fetch(res, n = 3)
print(df)
In data analysis, it is often necessary to read and process data stored outside the R environment. Importing data into R is a crucial step in such cases. R supports multiple file formats, including CSV, JSON, Excel, Text, and XML. Most data is available in tabular format, and R provides functions to read this structured data into a data frame. Data frames are widely used in R because they facilitate data extraction from rows and columns, making statistical computations easier than with other data structures.
Common Functions for Importing Data into R
The most frequently used functions for reading tabular data into R are:
read.table()
read.csv()
fromJSON()
read.xlsx()
Reading Data from a Text File
The read.table() function is used to read tabular data from a text file.
Parameters:
file: Specifies the file name.
header: A logical flag indicating if the first line contains column names.
nrows: Specifies the number of rows to read.
skip: Skips a specified number of lines from the beginning.
colClasses: A character vector indicating the class of each column.
sep: A string that defines column separators (e.g., commas, spaces, tabs).
For small or moderately sized datasets, read.table() can be called without arguments. R automatically detects rows, columns, column classes, and skips lines starting with # (comments). Specifying arguments enhances efficiency, especially for large datasets.
Example:
Assume a text file data.txt in the current directory contains the following data:
Name Age Salary
John 28 50000
Emma 25 60000
Alex 30 70000
Reading the file in R:
read.table("data.txt", header=TRUE)
Output:
Name Age Salary
1 John 28 50000
2 Emma 25 60000
3 Alex 30 70000
Reading Data from a CSV File
The read.csv() function is used for reading CSV files, which are commonly generated by spreadsheet applications like Microsoft Excel. It is similar to read.table() but uses a comma as the default separator and assumes header=TRUE by default.
Example:
Assume a CSV file data.csv contains the following:
Name Age Salary
1 John 28 50000
2 Emma 25 60000
3 Alex 30 70000
Reading Data from a CSV File
The read.csv() function is used for reading CSV files, which are commonly generated by spreadsheet applications like Microsoft Excel. It is similar to read.table() but uses a comma as the default separator and assumes header=TRUE by default.
Example:
Assume a CSV file data.csv contains the following:
The read.csv() function is used for reading CSV files, which are commonly generated by spreadsheet applications like Microsoft Excel. It is similar to read.table() but uses a comma as the default separator and assumes header=TRUE by default.
Example:
Assume a CSV file data.csv contains the following:
The read.csv() function is used for reading CSV files, which are commonly generated by spreadsheet applications like Microsoft Excel. It is similar to read.table() but uses a comma as the default separator and assumes header=TRUE by default.
Example:
Assume a CSV file data.csv contains the following:
Name Age Salary
1 John 28 50000
2 Emma 25 60000
3 Alex 30 70000
Reading Data from a CSV File
The read.csv() function is used for reading CSV files, which are commonly generated by spreadsheet applications like Microsoft Excel. It is similar to read.table() but uses a comma as the default separator and assumes header=TRUE by default.
Example:
Assume a CSV file data.csv contains the following:
Name Age Salary
1 John 28 50000
2 Emma 25 60000
3 Alex 30 70000
Memory Considerations
For large files, it is essential to estimate the memory required before loading data. The approximate memory needed for a dataset with 2,000,000 rows and 200 numeric columns can be calculated as:
2000000 x 200 x 8 bytes = 3.2 GB
Since R requires additional memory for processing, at least twice this amount (6.4 GB) should be available.
Reading Data from a JSON File
The fromJSON() function from the rjson package is used to import JSON data into R.
JSON (JavaScript Object Notation) is a widely used data format that stores information in a structured and readable manner, using text-based key-value pairs. Just like other files, JSON files can be both read and written in R. To work with JSON files in R, we need to install and use the rjson package.
Common JSON Operations in R
Using the rjson package, we can perform various tasks, including:
Installing and loading the rjson package
Creating a JSON file
Reading data from a JSON file
Writing data into a JSON file
Converting JSON data into a dataframe
Extracting data from URLs
Installing and Loading the rjson Package
To use JSON functionality in R, install the rjson package using the command below:
install.packages("rjson")
Once installed, load the package into the R environment using:
library("rjson")
To create a JSON file, follow these steps:
Open a text editor (such as Notepad) and enter data in the JSON format.
Save the file with a .json extension (e.g., sample.json).
The fromJSON() function helps read and parse JSON data from a file. The extracted data is stored as a list by default.
Example Code:
# Load required package
library("rjson")
# Read the JSON file from a specified location
data <- fromJSON(file = "D:\\sample.json")
# Print the data
print(data)
To write data into a JSON file, we first convert data into a JSON object using the toJSON() function and then use the write() function to store it in a file.
Example Code:
# Load the required package
library("rjson")
# Creating a list with sample data
data_list <- list(
Fruits = c("Apple", "Banana", "Mango"),
Category = c("Fruit", "Fruit", "Fruit")
)
# Convert list to JSON format
json_output <- toJSON(data_list)
# Write JSON data to a file
write(json_output, "output.json")
# Read and print the created JSON file
result <- fromJSON(file = "output.json")
print(result)
In R, JSON data can be transformed into a dataframe using as.data.frame(), allowing easy manipulation and analysis.
Example Code:
# Load required package
library("rjson")
# Read JSON file
data <- fromJSON(file = "D:\\sample.json")
# Convert JSON data to a dataframe
json_df <- as.data.frame(data)
# Print the dataframe
print(json_df)
Output:
EmployeeID Name Salary JoiningDate Department
1 101 Amit 55000 2015-03-25 IT
2 102 Rohit 63000 2018-07-10 HR
3 103 Sneha 72000 2020-01-15 Finance
4 104 Priya 80000 2017-09-12 Operations
5 105 Karan 59000 2019-05-30 Marketing
Working with JSON Data from a URL
JSON data can be extracted from online sources using either the jsonlite or RJSONIO package.
Example Code:
# Load the required package
library(RJSONIO)
# Fetch JSON data from a URL
data_url <- fromJSON("https://api.publicapis.org/entries")
# Extract specific fields
API_Names <- sapply(data_url$entries, function(x) x$API)
# Display first few API names
head(API_Names)