# Data Manipulation with R - Summary Notes

Data Manipulation with R - Summary Notes

Jianghao Wang

wangjh@lreis.ac.cn

Dec. 11, 2012: version 0.1

Reference

Spector, P., Data manipulation with R. Use R! 2008, New York: Springer. ix, 152 p.

Chapter 1 Data in R

Modes and classes

The mode function returns the mode of any object in R, and the class function returns the class.

mode: numeric, character, and logical, matrix, dataframe, dates, times et al.

typeof function

dates, times modes: Date, POSIXlt, and POSIXct classes, and the contributed chron package

sapply function

mylist = list(a=c(1,2,3),b=c("cat","dog","duck"), d=factor("a","b","a"))

sapply(mylist,mode)

sapply(mylist,class)

Data storage in R

c function (mnemonic for catenate or combine) eg: x = c(1,2,5,10); mode(x)

names assignment function, eg. names(x) = c('one','two','three')

Arrays or Matrix: nrow =, ncol =, byrow = T, dim, dimnames =, or row.names

rmat = matrix(rnorm(15),5,3, dimnames=list(NULL,c('A','B','C')))

list, like, mylist = list(first=c(1,3,5),second=c('one','three','five'), third='end')

data.frame

Testing for modes and classes

is.list, is.factor, is.numeric, is.data.frame, and is.character

Structure of R objects

summary(mylist)

str(mylist)

Conversion of Objects

as.** function

Missing values

NA represents as missing values

is.na function to test the missing value

Working with missing Values

set na.rm = TRUE or FALSE

x[!is.na(x)]

na.action = or na.omit for functions

na.strings=argument of read.table

Chapter 2 Reading and Writing Data

Reading Vectors and Matrices

for small data set, scan function

scan a matrix mymat = matrix(scan(),ncol=3,byrow=TRUE)

Data frames: read.table

read.table(file, header = FALSE, sep = "", quote = "\"'",

dec = ".", row.names, col.names,

as.is = !stringsAsFactors,

na.strings = "NA", colClasses = NA, nrows = -1,

skip = 0, check.names = TRUE, fill = !blank.lines.skip,

strip.white = FALSE, blank.lines.skip = TRUE,

comment.char = "#",

allowEscapes = FALSE, flush = FALSE,

stringsAsFactors = default.stringsAsFactors(),

fileEncoding = "", encoding = "unknown", text)

Comma- and Tab-Delimited Input Files

read.csv, read.csv2, and read.delim

Fixed - width input files

read.fwf function

city = read.fwf("city.txt",widths=c(18,-19,8),as.is=TRUE)

Extracting data from R objects

apropos function can be used to find all the available methods for a given class

Find Objects by (Partial) Name

apropos(".*\\.lm$")

## [1] "anova.lm" "anovalist.lm" "hatvalues.lm"

## [4] "kappa.lm" "model.frame.lm" "model.matrix.lm"

## [7] "plot.lm" "predict.lm" "print.lm"

## [10] "residuals.lm" "rstandard.lm" "rstudent.lm"

## [13] "summary.lm"

showMethods function

layout the tables

library(xtable)

setwd("C:/Users/Jinghao/Desktop/dataManipulation/")

data <- read.csv(file = "table/table1.csv", header = T)

print(xtable(data), type = "html")

Function Data.source

1 file Files on the local file system

2 pipe Output from a command

3 textConnection Treats text as a file

4 gzfile Local gzipped file

5 unz Local zip archive (with single file;read-only)

6 bzfile Local bzipped file

7 url Remote file read via http

8 socketConnection socket for client/server programs

Reading Large data files

Define a function to read large data sets

Generating data

Sequences

1:10 or seq(1, 10, 1)

gl function (“generate levels”)

thelevels = data.frame(group=gl(3,10,length=30),

subgroup=gl(5,2,length=30),

obs=gl(2,1,length=30))

expand.grid function oe = expand.grid(odd=seq(1,5,by=2),even=seq(2,5,by=2))

Random Numbers

for example: rnorm; runif

set.seed function

Permutations

Random Permutations

sample function

Enumerating all permutations

Working with sequences

table function, can tabulate the number of occurrences of each value in a sequence

unique function

duplicated function

Spreadsheets

The RODBC package on Windows

ODBConnectExcel function from the RODBC package

library(RODBC)

sheet = 'c:\\Documents and Settings\\user\\My Documents\\sheet.xls'

con = odbcConnectExcel(sheet)

Double slashes in the file name

sqlTables command

qry = paste("SELECT * FROM",tbls$TABLE_NAME[1],sep=' ')

result = sqlQuery(con,qry)

The gdata Package (All Platforms)

An alternative to using the RODBC package is the read.xls function of the gdata package, but requires perl to be installed on your computer.

Saving and Loading R Data Objects

save(list=c('x','y','z'),file='mydata.rda')

load('mydata.rda')

Working with Binary Files

The readBin and writeBin functions provide a flexible way to read and write binary files.

Writing R Objects to Files in ASCII Format

The write Function

write(t(state.x77),file='state.txt',ncolumns=ncol(state.x77))

The write.table or write.csv function

write.table(CO2,file='co2.txt',row.names=FALSE,sep=',')

Reading Data from Other Programs

foreign package

write.foreign

write.foreign(mydata,'mydata.txt','mydata.stata', package='Stata')

Chapter 3 R and Databases

ODBC (Open DataBase Connectivity)

DBI package of R along with a specialized package for the particular database

3.1 A Brief Guide to SQL

SELECT * FROM tablename WHERE var1 > 10 AND var2 < var1

Aggregation

SELECT type,AVG(x) AS mean FROM table GROUP BY type

3.2 ODBC

3.3 Using the RODBC Package

library(RODBC)

con = odbcConnect('myodbc')

con = odbcConnect('myodbc;password=xxxxx')

3.4 The DBI Package

One of the most popular databases used with R is **MySQL**

3.5 Accessing a MySQL Database

library(RMySQL)

drv = dbDriver("MySQL")

con = dbConnect(drv,dbname='test',user='sqluser',

password='secret',host='sql.company.com')

3.6 Performing Queries

`mydata = dbGetQuery(con,'select * from mydata')`

3.7 Normalized Tables

3.8 Getting Data into MySQL

CREATE TABLE mydata (name text, number double)

3.9 More Complex Aggregations

see documents for more detail

Chapter 4 Dates

as.Date function handles dates (without times);

the contributed package chron handles dates and times, but does not control for time zones

the POSIXct and POSIXlt classes allow for dates and times with control for time zones.

4.1 as.Date

as.Date("1/15/2001", format = "%m/%d/%Y")

## [1] "2001-01-15"

as.Date("April 26, 2001", format = "%B %d, %Y")

## [1] "2001-04-26"

as.Date("22JUN01", format = "%d%b%y")

## [1] "2001-06-22"

4.2 The chron Package

The chron function converts dates and times to chron objects

m Month (decimal number)

d Day of the month (decimal number)

y Year (4 digit)

mon Month (abbreviated)

month Month (full name)

h Hour

m Minute

s Second

4.3 POSIX Classes

The POSIXct class stores date/time values as the number of seconds since January 1, 1970, while the POSIXlt class stores them as a list with elements for second, minute, hour, day, month, and year, among others.

example mydate = strptime('16/Oct/2005:07:51:00', format='%d/%b/%Y:%H:%M:%S')

format data

thedate = ISOdate(2005,10,21,18,47,22,tz="PDT")

format(thedate,'%A, %B %d, %Y %H:%M:%S')

as.POSIXlt and as.POSIXct can also accept Date or chron objects

4.4 Working with Dates

Many of the statistical summary functions, like mean, min, max, etc are able to transparently handle date objects.

mean(rdates$Date)

range(rdates$Date)

rdates$Date[11] - rdates$Date[1]

4.5 Time Intervals

b1 = ISOdate(1977,7,13)

b2 = ISOdate(2003,8,14)

b2 - b1

Time difference of 9528 days

difftime function example difftime(b2,b1,units='weeks')

4.6 Time Sequences

seq(as.Date("1976-7-4"), by = "days", length = 10)

## [1] "1976-07-04" "1976-07-05" "1976-07-06" "1976-07-07" "1976-07-08"

## [6] "1976-07-09" "1976-07-10" "1976-07-11" "1976-07-12" "1976-07-13"

seq(as.Date("2000-6-1"), to = as.Date("2000-8-1"), by = "2 weeks")

## [1] "2000-06-01" "2000-06-15" "2000-06-29" "2000-07-13" "2000-07-27"

Chater 5 Factors

Such variables are often referred to as categorical variables

5.1 Using Factors

args(factor)

## function (x = character(), levels, labels = levels, exclude = NA,

## ordered = is.ordered(x))

## NULL

5.3 Manipulating Factors

5.4 Creating Factors from Continuous Variables

The cut function is used to convert a numeric variable into a factor.

wfact = cut(women$weight,3)

wfact = cut(women$weight,pretty(women$weight,3))

wfact = cut(women$weight,3,labels=c('Low','Medium','High'))

table(wfact)

5.5 Factors Based on Dates and Times

months = factor(cmonth,levels=unique(cmonth),ordered=TRUE)

wks = cut(everyday,breaks='week')

5.6 Interactions

args(interaction)

## function (..., drop = FALSE, sep = ".", lex.order = FALSE)

## NULL

Chater 6 Subscripting

6.1 Basics of Subscripting

For objects that contain more than one element (vectors, matrices, arrays, data frames, and lists), subscripting is used to access some or all of those elements.

6.2 Numeric Subscripts

6.3 Character Subscripts

6.4 Logical Subscripts

which function logical

6.5 Subscripting Matrices and Arrays

order function returns a vector of indices that will permute its input argument into sorted order.

6.6 Specialized Functions for Matrices

row, col, lower.tri, upper.tri and diag =

6.7 Lists

6.8 Subscripting Data Frames

args(subset)

## function (x, ...)

## NULL

Chater 7 Character Manipulation

7.1 Basics of Character Data length function will report the number of character values

To find the number of characters in a character value, the nchar function can be used

7.2 Displaying and Concatenating Character Strings

Outputs the objects, concatenating the representations. cat performs much less conversion than print.

The cat function will combine character values and print them to the screen or a file directly.

paste function

7.3 Working with Parts of Character Values

substring function can be used either to extract parts of character strings, or to change the values of parts of character strings.

7.4 Regular Expressions in R

Regular expressions are supported in the R functions strsplit, grep, sub, and gsub, as well as in the regexpr and gregexpr functions which are the main tools for working with regular expressions in R.

7.5 Basics of Regular Expressions

. ^ $ + ? * ( ) [ ] { } | \

7.6 Breaking Apart Character Values

The strsplit function can use a character string or regular expression to divide up a character string into smaller pieces.

7.7 Using Regular Expressions in R

The grep function accepts a regular expression and a character string or vector of character strings, and returns the indices of those elements of the strings which are matched by the regular expression.

7.8 Substitutions and Tagging

For substituting text based on regular expressions, R provides two functions: sub and gsub

Chater 8 Data Aggregation

those that are designed to work effectively with arrays and/or lists, like apply, sweep, mapply, sapply, and lapply, and

those that are oriented toward data frames (like aggregate and by).

8.1 table

pets = c("dog", "cat", "duck", "chicken", "duck", "cat", "dog")

tt = table(pets)

tt

## pets

## cat chicken dog duck

## 2 1 2 2

as.data.frame(tt)

## pets Freq

## 1 cat 2

## 2 chicken 1

## 3 dog 2

## 4 duck 2

8.2 Road Map for Aggregation

Groups defined as list elements, such as sapply and lapply

lapply always returns a list, while sapply may simplify its output into a vector or array if appropriate

Groups defined by rows or columns of a matrix, such as apply

Groups based on one or more grouping variables, such as aggregate

reshape package can do this things

tapply to extract the appropriate rows corresponding to each group

split and sapply/lapply is a good solution if you find that other methods don’t provide the flexibility you need

args(sapply)

## function (X, FUN, ..., simplify = TRUE, USE.NAMES = TRUE)

## NULL

args(lapply)

## function (X, FUN, ...)

## NULL

args(apply)

## function (X, MARGIN, FUN, ...)

## NULL

args(aggregate)

## function (x, ...)

## NULL

args(tapply)

## function (X, INDEX, FUN = NULL, ..., simplify = TRUE)

## NULL

8.3 Mapping a Function to a Vector or List

lapply and sapply

lapply and sapply can be used as an alternative to loops for performing repetitive tasks.

8.4 Mapping a function to a matrix or array

apply function

One common use of apply is in conjunction with functions like scale, which require summary statistics calculated for each column of a matrix.

sstate = scale(state.x77,center=apply(state.x77,2,median), scale=apply(state.x77,2,mad))

summfn = function(x)c(n=sum(!is.na(x)),mean=mean(x),sd=sd(x))

x = apply(state.x77,2,sumfun)

t(x)

# example2

x = 1:12

apply(matrix(x,ncol=3,byrow=TRUE),1,sum)

8.5 Mapping a Function Based on Groups

To calculate scalar data summaries of one or more columns of a data frame or matrix, the aggregate function can be used.

do.call constructs and executes a function call from a name or a function and a list of arguments to be passed to it.

8.6 The reshape Package

see package help

8.7 Loops in R

system.time to list the calculation time

Chater 9 Reshaping Data

9.1 Modifying Data Frame Variables

9.2 Recoding Variables

9.3 The recode Function

9.4 Reshaping Data Frames

The stack function can reorganize datasets to have this property

unstack function will reorganize stacked data back to the one column per group form.

args(reshape)

## function (data, varying = NULL, v.names = NULL, timevar = "time",

## idvar = "id", ids = 1L:NROW(data), times = seq_along(varying[[1L］),

## drop = NULL, direction, new.row.names = NULL, sep = ".",

## split = if (sep == "") {

## list(regexp = "[A-Za-z][0-9]", include = TRUE)

## } else {

## list(regexp = sep, include = FALSE, fixed = TRUE)

## })

## NULL

9.5 The reshape Package

melting function

9.6 Combining Data Frames

cbind column bind

rbind row bind

An easy way to test is to pass the names of the two data frames to the intersect function

merge function

merge(x, y, by = intersect(names(x), names(y)),

by.x = by, by.y = by, all = FALSE, all.x = all, all.y = all,

sort = TRUE, suffixes = c(".x",".y"),

incomparables = NULL, ...)

9.7 Under the Hood of merge

match(x, table, nomatch = NA_integer_, incomparables = NULL)

end of document

Jianghao Wang

wangjh@lreis.ac.cn

Dec. 11, 2012: version 0.1

Reference

Spector, P., Data manipulation with R. Use R! 2008, New York: Springer. ix, 152 p.

Chapter 1 Data in R

Modes and classes

The mode function returns the mode of any object in R, and the class function returns the class.

mode: numeric, character, and logical, matrix, dataframe, dates, times et al.

typeof function

dates, times modes: Date, POSIXlt, and POSIXct classes, and the contributed chron package

sapply function

mylist = list(a=c(1,2,3),b=c("cat","dog","duck"), d=factor("a","b","a"))

sapply(mylist,mode)

sapply(mylist,class)

Data storage in R

c function (mnemonic for catenate or combine) eg: x = c(1,2,5,10); mode(x)

names assignment function, eg. names(x) = c('one','two','three')

Arrays or Matrix: nrow =, ncol =, byrow = T, dim, dimnames =, or row.names

rmat = matrix(rnorm(15),5,3, dimnames=list(NULL,c('A','B','C')))

list, like, mylist = list(first=c(1,3,5),second=c('one','three','five'), third='end')

data.frame

Testing for modes and classes

is.list, is.factor, is.numeric, is.data.frame, and is.character

Structure of R objects

summary(mylist)

str(mylist)

Conversion of Objects

as.** function

Missing values

NA represents as missing values

is.na function to test the missing value

Working with missing Values

set na.rm = TRUE or FALSE

x[!is.na(x)]

na.action = or na.omit for functions

na.strings=argument of read.table

Chapter 2 Reading and Writing Data

Reading Vectors and Matrices

for small data set, scan function

scan a matrix mymat = matrix(scan(),ncol=3,byrow=TRUE)

Data frames: read.table

read.table(file, header = FALSE, sep = "", quote = "\"'",

dec = ".", row.names, col.names,

as.is = !stringsAsFactors,

na.strings = "NA", colClasses = NA, nrows = -1,

skip = 0, check.names = TRUE, fill = !blank.lines.skip,

strip.white = FALSE, blank.lines.skip = TRUE,

comment.char = "#",

allowEscapes = FALSE, flush = FALSE,

stringsAsFactors = default.stringsAsFactors(),

fileEncoding = "", encoding = "unknown", text)

Comma- and Tab-Delimited Input Files

read.csv, read.csv2, and read.delim

Fixed - width input files

read.fwf function

city = read.fwf("city.txt",widths=c(18,-19,8),as.is=TRUE)

Extracting data from R objects

apropos function can be used to find all the available methods for a given class

Find Objects by (Partial) Name

apropos(".*\\.lm$")

## [1] "anova.lm" "anovalist.lm" "hatvalues.lm"

## [4] "kappa.lm" "model.frame.lm" "model.matrix.lm"

## [7] "plot.lm" "predict.lm" "print.lm"

## [10] "residuals.lm" "rstandard.lm" "rstudent.lm"

## [13] "summary.lm"

showMethods function

layout the tables

library(xtable)

setwd("C:/Users/Jinghao/Desktop/dataManipulation/")

data <- read.csv(file = "table/table1.csv", header = T)

print(xtable(data), type = "html")

Function Data.source

1 file Files on the local file system

2 pipe Output from a command

3 textConnection Treats text as a file

4 gzfile Local gzipped file

5 unz Local zip archive (with single file;read-only)

6 bzfile Local bzipped file

7 url Remote file read via http

8 socketConnection socket for client/server programs

Reading Large data files

Define a function to read large data sets

Generating data

Sequences

1:10 or seq(1, 10, 1)

gl function (“generate levels”)

thelevels = data.frame(group=gl(3,10,length=30),

subgroup=gl(5,2,length=30),

obs=gl(2,1,length=30))

expand.grid function oe = expand.grid(odd=seq(1,5,by=2),even=seq(2,5,by=2))

Random Numbers

for example: rnorm; runif

set.seed function

Permutations

Random Permutations

sample function

Enumerating all permutations

Working with sequences

table function, can tabulate the number of occurrences of each value in a sequence

unique function

duplicated function

Spreadsheets

The RODBC package on Windows

ODBConnectExcel function from the RODBC package

library(RODBC)

sheet = 'c:\\Documents and Settings\\user\\My Documents\\sheet.xls'

con = odbcConnectExcel(sheet)

Double slashes in the file name

sqlTables command

qry = paste("SELECT * FROM",tbls$TABLE_NAME[1],sep=' ')

result = sqlQuery(con,qry)

The gdata Package (All Platforms)

An alternative to using the RODBC package is the read.xls function of the gdata package, but requires perl to be installed on your computer.

Saving and Loading R Data Objects

save(list=c('x','y','z'),file='mydata.rda')

load('mydata.rda')

Working with Binary Files

The readBin and writeBin functions provide a flexible way to read and write binary files.

Writing R Objects to Files in ASCII Format

The write Function

write(t(state.x77),file='state.txt',ncolumns=ncol(state.x77))

The write.table or write.csv function

write.table(CO2,file='co2.txt',row.names=FALSE,sep=',')

Reading Data from Other Programs

foreign package

write.foreign

write.foreign(mydata,'mydata.txt','mydata.stata', package='Stata')

Chapter 3 R and Databases

ODBC (Open DataBase Connectivity)

DBI package of R along with a specialized package for the particular database

3.1 A Brief Guide to SQL

SELECT * FROM tablename WHERE var1 > 10 AND var2 < var1

Aggregation

SELECT type,AVG(x) AS mean FROM table GROUP BY type

3.2 ODBC

3.3 Using the RODBC Package

library(RODBC)

con = odbcConnect('myodbc')

con = odbcConnect('myodbc;password=xxxxx')

3.4 The DBI Package

One of the most popular databases used with R is **MySQL**

3.5 Accessing a MySQL Database

library(RMySQL)

drv = dbDriver("MySQL")

con = dbConnect(drv,dbname='test',user='sqluser',

password='secret',host='sql.company.com')

3.6 Performing Queries

`mydata = dbGetQuery(con,'select * from mydata')`

3.7 Normalized Tables

3.8 Getting Data into MySQL

CREATE TABLE mydata (name text, number double)

3.9 More Complex Aggregations

see documents for more detail

Chapter 4 Dates

as.Date function handles dates (without times);

the contributed package chron handles dates and times, but does not control for time zones

the POSIXct and POSIXlt classes allow for dates and times with control for time zones.

4.1 as.Date

as.Date("1/15/2001", format = "%m/%d/%Y")

## [1] "2001-01-15"

as.Date("April 26, 2001", format = "%B %d, %Y")

## [1] "2001-04-26"

as.Date("22JUN01", format = "%d%b%y")

## [1] "2001-06-22"

4.2 The chron Package

The chron function converts dates and times to chron objects

m Month (decimal number)

d Day of the month (decimal number)

y Year (4 digit)

mon Month (abbreviated)

month Month (full name)

h Hour

m Minute

s Second

4.3 POSIX Classes

The POSIXct class stores date/time values as the number of seconds since January 1, 1970, while the POSIXlt class stores them as a list with elements for second, minute, hour, day, month, and year, among others.

example mydate = strptime('16/Oct/2005:07:51:00', format='%d/%b/%Y:%H:%M:%S')

format data

thedate = ISOdate(2005,10,21,18,47,22,tz="PDT")

format(thedate,'%A, %B %d, %Y %H:%M:%S')

as.POSIXlt and as.POSIXct can also accept Date or chron objects

4.4 Working with Dates

Many of the statistical summary functions, like mean, min, max, etc are able to transparently handle date objects.

mean(rdates$Date)

range(rdates$Date)

rdates$Date[11] - rdates$Date[1]

4.5 Time Intervals

b1 = ISOdate(1977,7,13)

b2 = ISOdate(2003,8,14)

b2 - b1

Time difference of 9528 days

difftime function example difftime(b2,b1,units='weeks')

4.6 Time Sequences

seq(as.Date("1976-7-4"), by = "days", length = 10)

## [1] "1976-07-04" "1976-07-05" "1976-07-06" "1976-07-07" "1976-07-08"

## [6] "1976-07-09" "1976-07-10" "1976-07-11" "1976-07-12" "1976-07-13"

seq(as.Date("2000-6-1"), to = as.Date("2000-8-1"), by = "2 weeks")

## [1] "2000-06-01" "2000-06-15" "2000-06-29" "2000-07-13" "2000-07-27"

Chater 5 Factors

Such variables are often referred to as categorical variables

5.1 Using Factors

args(factor)

## function (x = character(), levels, labels = levels, exclude = NA,

## ordered = is.ordered(x))

## NULL

5.3 Manipulating Factors

5.4 Creating Factors from Continuous Variables

The cut function is used to convert a numeric variable into a factor.

wfact = cut(women$weight,3)

wfact = cut(women$weight,pretty(women$weight,3))

wfact = cut(women$weight,3,labels=c('Low','Medium','High'))

table(wfact)

5.5 Factors Based on Dates and Times

months = factor(cmonth,levels=unique(cmonth),ordered=TRUE)

wks = cut(everyday,breaks='week')

5.6 Interactions

args(interaction)

## function (..., drop = FALSE, sep = ".", lex.order = FALSE)

## NULL

Chater 6 Subscripting

6.1 Basics of Subscripting

For objects that contain more than one element (vectors, matrices, arrays, data frames, and lists), subscripting is used to access some or all of those elements.

6.2 Numeric Subscripts

6.3 Character Subscripts

6.4 Logical Subscripts

which function logical

6.5 Subscripting Matrices and Arrays

order function returns a vector of indices that will permute its input argument into sorted order.

6.6 Specialized Functions for Matrices

row, col, lower.tri, upper.tri and diag =

6.7 Lists

6.8 Subscripting Data Frames

args(subset)

## function (x, ...)

## NULL

Chater 7 Character Manipulation

7.1 Basics of Character Data length function will report the number of character values

To find the number of characters in a character value, the nchar function can be used

7.2 Displaying and Concatenating Character Strings

Outputs the objects, concatenating the representations. cat performs much less conversion than print.

The cat function will combine character values and print them to the screen or a file directly.

paste function

7.3 Working with Parts of Character Values

substring function can be used either to extract parts of character strings, or to change the values of parts of character strings.

7.4 Regular Expressions in R

Regular expressions are supported in the R functions strsplit, grep, sub, and gsub, as well as in the regexpr and gregexpr functions which are the main tools for working with regular expressions in R.

7.5 Basics of Regular Expressions

. ^ $ + ? * ( ) [ ] { } | \

7.6 Breaking Apart Character Values

The strsplit function can use a character string or regular expression to divide up a character string into smaller pieces.

7.7 Using Regular Expressions in R

The grep function accepts a regular expression and a character string or vector of character strings, and returns the indices of those elements of the strings which are matched by the regular expression.

7.8 Substitutions and Tagging

For substituting text based on regular expressions, R provides two functions: sub and gsub

Chater 8 Data Aggregation

those that are designed to work effectively with arrays and/or lists, like apply, sweep, mapply, sapply, and lapply, and

those that are oriented toward data frames (like aggregate and by).

8.1 table

pets = c("dog", "cat", "duck", "chicken", "duck", "cat", "dog")

tt = table(pets)

tt

## pets

## cat chicken dog duck

## 2 1 2 2

as.data.frame(tt)

## pets Freq

## 1 cat 2

## 2 chicken 1

## 3 dog 2

## 4 duck 2

8.2 Road Map for Aggregation

Groups defined as list elements, such as sapply and lapply

lapply always returns a list, while sapply may simplify its output into a vector or array if appropriate

Groups defined by rows or columns of a matrix, such as apply

Groups based on one or more grouping variables, such as aggregate

reshape package can do this things

tapply to extract the appropriate rows corresponding to each group

split and sapply/lapply is a good solution if you find that other methods don’t provide the flexibility you need

args(sapply)

## function (X, FUN, ..., simplify = TRUE, USE.NAMES = TRUE)

## NULL

args(lapply)

## function (X, FUN, ...)

## NULL

args(apply)

## function (X, MARGIN, FUN, ...)

## NULL

args(aggregate)

## function (x, ...)

## NULL

args(tapply)

## function (X, INDEX, FUN = NULL, ..., simplify = TRUE)

## NULL

8.3 Mapping a Function to a Vector or List

lapply and sapply

lapply and sapply can be used as an alternative to loops for performing repetitive tasks.

8.4 Mapping a function to a matrix or array

apply function

One common use of apply is in conjunction with functions like scale, which require summary statistics calculated for each column of a matrix.

sstate = scale(state.x77,center=apply(state.x77,2,median), scale=apply(state.x77,2,mad))

summfn = function(x)c(n=sum(!is.na(x)),mean=mean(x),sd=sd(x))

x = apply(state.x77,2,sumfun)

t(x)

# example2

x = 1:12

apply(matrix(x,ncol=3,byrow=TRUE),1,sum)

8.5 Mapping a Function Based on Groups

To calculate scalar data summaries of one or more columns of a data frame or matrix, the aggregate function can be used.

do.call constructs and executes a function call from a name or a function and a list of arguments to be passed to it.

8.6 The reshape Package

see package help

8.7 Loops in R

system.time to list the calculation time

Chater 9 Reshaping Data

9.1 Modifying Data Frame Variables

9.2 Recoding Variables

9.3 The recode Function

9.4 Reshaping Data Frames

The stack function can reorganize datasets to have this property

unstack function will reorganize stacked data back to the one column per group form.

args(reshape)

## function (data, varying = NULL, v.names = NULL, timevar = "time",

## idvar = "id", ids = 1L:NROW(data), times = seq_along(varying[[1L］),

## drop = NULL, direction, new.row.names = NULL, sep = ".",

## split = if (sep == "") {

## list(regexp = "[A-Za-z][0-9]", include = TRUE)

## } else {

## list(regexp = sep, include = FALSE, fixed = TRUE)

## })

## NULL

9.5 The reshape Package

melting function

9.6 Combining Data Frames

cbind column bind

rbind row bind

An easy way to test is to pass the names of the two data frames to the intersect function

merge function

merge(x, y, by = intersect(names(x), names(y)),

by.x = by, by.y = by, all = FALSE, all.x = all, all.y = all,

sort = TRUE, suffixes = c(".x",".y"),

incomparables = NULL, ...)

9.7 Under the Hood of merge

match(x, table, nomatch = NA_integer_, incomparables = NULL)

end of document

有关键情节透露

> 我来回应