Reading Tabular Data in detail
In data analysis, it is often necessary to read and process data stored outside the R environment. Importing data into R is a crucial step in such cases. R supports multiple file formats, including CSV, JSON, Excel, Text, and XML. Most data is available in tabular format, and R provides functions to read this structured data into a data frame. Data frames are widely used in R because they facilitate data extraction from rows and columns, making statistical computations easier than with other data structures.
Common Functions for Importing Data into R
The most frequently used functions for reading tabular data into R are:
read.table()read.csv()fromJSON()read.xlsx()
Reading Data from a Text File
The read.table() function is used to read tabular data from a text file.
Parameters:
file: Specifies the file name.header: A logical flag indicating if the first line contains column names.nrows: Specifies the number of rows to read.skip: Skips a specified number of lines from the beginning.colClasses: A character vector indicating the class of each column.sep: A string that defines column separators (e.g., commas, spaces, tabs).
For small or moderately sized datasets, read.table() can be called without arguments. R automatically detects rows, columns, column classes, and skips lines starting with # (comments). Specifying arguments enhances efficiency, especially for large datasets.
Example:
Assume a text file data.txt in the current directory contains the following data:
Name Age Salary
John 28 50000
Emma 25 60000
Alex 30 70000
Reading the file in R:
read.table("data.txt", header=TRUE)
Output:
Name Age Salary
1 John 28 50000
2 Emma 25 60000
3 Alex 30 70000
Reading Data from a CSV File
The read.csv() function is used for reading CSV files, which are commonly generated by spreadsheet applications like Microsoft Excel. It is similar to read.table() but uses a comma as the default separator and assumes header=TRUE by default.
Example:
Assume a CSV file data.csv contains the following:
Name,Age,Salary
John,28,50000
Emma,25,60000
Alex,30,70000
Reading the file in R:
read.table("data.txt", header=TRUE)
Output:
Name Age Salary
1 John 28 50000
2 Emma 25 60000
3 Alex 30 70000
Reading Data from a CSV File
The read.csv() function is used for reading CSV files, which are commonly generated by spreadsheet applications like Microsoft Excel. It is similar to read.table() but uses a comma as the default separator and assumes header=TRUE by default.
Example:
Assume a CSV file data.csv contains the following:
Name,Age,Salary
John,28,50000
Emma,25,60000
Alex,30,70000
Reading Data from a CSV File
The read.csv() function is used for reading CSV files, which are commonly generated by spreadsheet applications like Microsoft Excel. It is similar to read.table() but uses a comma as the default separator and assumes header=TRUE by default.
Example:
Assume a CSV file data.csv contains the following:
Name,Age,Salary
John,28,50000
Emma,25,60000
Alex,30,70000
Reading Data from a CSV File
The read.csv() function is used for reading CSV files, which are commonly generated by spreadsheet applications like Microsoft Excel. It is similar to read.table() but uses a comma as the default separator and assumes header=TRUE by default.
Example:
Assume a CSV file data.csv contains the following:
Name,Age,Salary
John,28,50000
Emma,25,60000
Alex,30,70000
Reading the file in R:
3 Alex 30 70000
Output:
Name Age Salary
1 John 28 50000
2 Emma 25 60000
3 Alex 30 70000
Reading Data from a CSV File
The read.csv() function is used for reading CSV files, which are commonly generated by spreadsheet applications like Microsoft Excel. It is similar to read.table() but uses a comma as the default separator and assumes header=TRUE by default.
Example:
Assume a CSV file data.csv contains the following:
Name,Age,Salary
John,28,50000
Emma,25,60000
Alex,30,70000
Reading the file in R:
read.csv("data.csv")
Output:
Name Age Salary
1 John 28 50000
2 Emma 25 60000
3 Alex 30 70000
Memory Considerations
For large files, it is essential to estimate the memory required before loading data. The approximate memory needed for a dataset with 2,000,000 rows and 200 numeric columns can be calculated as:
2000000 x 200 x 8 bytes = 3.2 GB
Since R requires additional memory for processing, at least twice this amount (6.4 GB) should be available.
Reading Data from a JSON File
The fromJSON() function from the rjson package is used to import JSON data into R.
Installation:
install.packages("rjson")
Example:
Assume a JSON file data.json contains:
{
"Name": ["John", "Emma", "Alex"],
"Age": [28, 25, 30],
"Salary": [50000, 60000, 70000]
}
Reading the JSON file in R:
library(rjson)
data <- fromJSON(file="data.json")
as.data.frame(data)
Reading Excel Sheets
The read.xlsx() function is used to import Excel worksheets into R. It requires the xlsx package.
Installation:
install.packages("xlsx")
Example:
Assume an Excel file data.xlsx with the following content:
| Name | Age | Salary |
|---|---|---|
| John | 28 | 50000 |
| Emma | 25 | 60000 |
| Alex | 30 | 70000 |
Reading the first sheet:
library("xlsx")
read.xlsx("data.xlsx", 1)
Output:
Name Age Salary
1 John 28 50000
2 Emma 25 60000
3 Alex 30 70000
For large datasets (over 100,000 cells), read.xlsx2() is preferred as it works faster by using the readColumns() function optimized for tabular data.
By using these functions, data can be efficiently imported into R for further processing and analysis.
Leave a Reply