What is dplyr ?
dplyr is a powerful R-package to transform, summarize, and perform data manipulation. The package contains a set of functions (or “verbs”) that perform common data manipulation operations such as filtering for rows, selecting specific columns, re-ordering rows, adding new columns, and summarizing data.
In addition, dplyr contains a useful function to perform another common task which is the “split-apply-combine” concept.
Important dplyr Verbs
The dplyr package gives you a handful of useful verbs for managing data. On their own, they don’t do anything that base R can’t do. Here are some of the single-table verbs we’ll be working within this lesson (single-table meaning that they only work on a single table – contrast that to two-table verbs used for joining data together, which we’ll cover in a later lesson).
- select() : select columns
- filter() : filter rows
- arrange() : re-order or arrange rows
- mutate() : create new columns
- The %>% operator allows for piping.
- summarise() : summarise values
- group_by() : allows for group operations in the “split-apply-combine” concep
Initially we need to load the library first so that we can use it, by using install.package("dplyr")
we can install the package and use it
> install.package("dplyr")
> library(dplyr)
select()
The select()
function returns only certain columns. The first argument is the data, and subsequent arguments are the columns you want.
For this example we are going to use R built in dataset called iris
# To select only one column from the data frame
> select(iris, Species)
# To select to or more columns
> select(iris, c(1:3))
Notice above how the original data doesn’t change. We’re selecting out only certain columns of interest and throwing away columns we don’t care about. If we wanted to keep this data, we would need to reassign the result of the select()
operation to a new object. Let’s make a new object called obj
that does not contain the GO annotations. Notice again how the original data is unchanged.
> obj <- select(iris, Species)
filter()
If you want to filter rows of the data where some condition is true, use the filter()
function.
- The first argument is the data frame you want to filter, e.g.
filter(mydata, ...
. - The second argument is a condition you must satisfy, e.g.
filter(ydat, symbol == "LEU1")
. If you want to satisfy all of the multiple conditions, you can use the “and” operator,&
. The “or” operator|
(the pipe character, usually shift-backslash) will return a subset that meets any of the conditions.
==
: Equal to!=
: Not equal to>
,>=
: Greater than, greater than or equal to<
,<=
: Less than, less than or equal to
Let’s try it out. For this to work you have to have already loaded the dplyr package.
# To filer the data for only setosa species
> filter(iris, Species == "setosa")
# To filter where Sepal.Length is less than 4.5
> filter(iris, Sepal.Length < 4.5)
# To filter for two species
> filter(iris, Species == "setosa" | Species == "virginica")
mutate()
The mutate()
function adds new columns to the data. Remember, it doesn’t actually modify the data frame you’re operating on, and the result is transient unless you assign it to a new object or reassign it back to itself (generally, not always a good practice).
Mutate has a nice little feature too in that it’s “lazy.” You can mutate and add one variable, then continue mutating to add more variables based on that variable. Let’s make another column that’s the square root of the signal ratio.
Let’s try it out on our iris dataset
# Let's add a new colum for the area of sepal
> df <- mutuate(iris, sepal.area = Sepal.Length * Sepal.Width)
# Let's add little margin to the area
> df <- mutuate(df, sepal.area = sepal.area
arrange()
The arrange()
function does what it sounds like. It takes a data frame or tbl and arranges (or sorts) by column(s) of interest. The first argument is the data, and subsequent arguments are columns to sort on. Use the desc()
function to arrange by descending.
# Arrange data on the sepal.length
# By default it will arrange in ascending order
> df <- arrange(iris, Sepal.length)
# Arrange data by desceding order
> df <- arrage(iris, desc(Sepal.Length))
summarize()
The summarize()
function summarizes multiple values to a single value. On its own, the summarize()
function doesn’t seem to be all that useful. The dplyr package provides a few convenience functions called n()
and n_distinct()
that tells you the number of observations or the number of distinct values of a particular variable.
# Get the mean Sepal.Length for iris
> summarize(iris, mean(Sepal.Length))
# Use a more friendly name, e.g., Sepal.mean, or whatever you want to call it.
> summarize(iris, Sepal.mean=mean(Sepal.Width))
# Measure the correlation between width and height
> summarize(iris, r=cor(Sepal.Width, Sepal.Height))
# Get the number of observations
> summarize(iris, n())
# The number of distinct Species in the data
> summarize(iris, n_distinct(Species))
group_by()
We saw that summarize()
isn’t that useful on its own. Neither is group_by()
All this does is takes an existing data frame and converts it into a grouped data frame where operations are performed by the group.
# Let's group the iris data by Species
> df <- group_by(iris, Species)
> df
# A tibble: 150 x 5
# Groups: Species [3]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
# ... with 140 more rows
The real power comes in where group_by()
and summarize()
are used together. First, write the group_by()
statement. Then wrap the result of that with a call to summarize()
.
# Let's get the mean sepa.length for each type of Species
> summarize(group_by(iris, Species), mean(Sepal.Length))
# A tibble: 3 x 2
Species `mean(Sepal.Length)`
<fct> <dbl>
1 setosa 5.01
2 versicolor 5.94
3 virginica 6.59
The pipe: %>%
How %>% works
This is where things get awesome. The dplyr
package imports functionality from the magrittr
package that lets you pipe the output of one function to the input of another, so you can avoid nesting functions. It looks like this: %>%
. You don’t have to load the magrittr
package to use it since dplyr
imports its functionality when you load the dplyr
package.
Here’s the simplest way to use it. Remember the tail()
function. It expects a data frame as input, and the next argument is the number of lines to print. These two commands are identical:
> tail(iris, 5)
# or
> iris %>% tail(5)
Let’s use one of the dplyr verbs.
> filter(iris, Species=="virginica")
> iris %>% filter(Species=="virginica")
Conclusion
Hence, we saw some common verbs in dplyr with each of its example.
At first a bit overwhelming, but with just a little practice, you will soon master the most useful components of dplyr. You may find that every R script/notebook you write will be better with dplyr. Your R data processing will be more concise, understandable, and development time will be cut down dramatically.
So, the next time you want to perform data manipulation in R, dplyr is the way to go!
This brings the end of this Blog. We really appreciate your time.
Hope you liked it.
Do visit our page www.zigya.com/blog for more informative blogs on Data Science
Keep Reading! Cheers!
Zigya Academy
BEING RELEVANT