Understanding the Dplyr Package in R

By Hammed Abdulsalam | October 23, 2018

The dplyr is an important R-package used in transforming and summarizing tabular data in R.It can be used to perform major common data manipulation functions such as selection of specific columns, filtering and rearranging rows, adding new columns and grouping rows and columns.

Getting Started with dplyr

Let’s start by loading the dplyr package into Rstudio

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

For this tutorial, we will be using the College Scorecard project dataset. The College Scorecard project provides data to help students and families compare college costs and outcomes as they weigh the tradeoffs of different colleges, accounting for their own needs and educational goals. You can click here to download the College Scorecard project data

We load the data into R

college_scorecard <- readRDS("C:/Users/factual/college_scorecard.rds")

In this tutorial, we will be exploring the following dplyr functions.

arrange() for rearranging or reordering rows

filter() for filtering rows

select() for selecting columns

mutate() for creating new columns

group_by() for goruping operations

summarise() for summarising values

Using the select() function

Let’s select the name, size and grad_students (number of graduate students in each institution) from the dataset

grad_pop<-select(college_scorecard,name, size, grad_students)
head(grad_pop)
##                                  name  size grad_students
## 1            Alabama A & M University  4081           760
## 2 University of Alabama at Birmingham 11014          6708
## 3                  Amridge University   338           365
## 4 University of Alabama in Huntsville  5703          1754
## 5            Alabama State University  5123           686
## 6           The University of Alabama 27364          5477

We can also select all the columns except a specific column by using negative indexing “-”

grad_pop2<- select(grad_pop, -name )
head(grad_pop2)
##    size grad_students
## 1  4081           760
## 2 11014          6708
## 3   338           365
## 4  5703          1754
## 5  5123           686
## 6 27364          5477

To select all the columns except a specific group of columns

grad_pop3<- select(grad_pop, -c(name,size) )
head(grad_pop3)
##   grad_students
## 1           760
## 2          6708
## 3           365
## 4          1754
## 5           686
## 6          5477

To select a range of columns, we can use the colon (:) operator

range_select<-select(college_scorecard, part_time_share:avg_net_price.private  )
head(range_select)
##   part_time_share operating avg_net_price.public avg_net_price.private
## 1          0.0762        NA                11726                    NA
## 2          0.2533        NA                15123                    NA
## 3          0.3728        NA                   NA                  6927
## 4          0.2271        NA                11160                    NA
## 5          0.0859        NA                11029                    NA
## 6          0.0824        NA                18526                    NA

We can use the function starts_with() to select all columns that start with any character string, e.g “avg”

chr_select<-select(college_scorecard, starts_with("avg"))
head(chr_select)
##   avg_net_price.public avg_net_price.private
## 1                11726                    NA
## 2                15123                    NA
## 3                   NA                  6927
## 4                11160                    NA
## 5                11029                    NA
## 6                18526                    NA

We can also use the function ends_with() to select all columns that end with any character string, e.g “ic”

chr_select2<-select(college_scorecard, ends_with("ic"))
head(chr_select2)
##   minority_serving.hispanic demographics.race_ethnicity.hispanic
## 1                        NA                               0.0037
## 2                        NA                               0.0246
## 3                        NA                               0.0089
## 4                        NA                               0.0344
## 5                        NA                               0.0102
## 6                        NA                               0.0282
##   avg_net_price.public
## 1                11726
## 2                15123
## 3                   NA
## 4                11160
## 5                11029
## 6                18526

Using the filter() function

Let’s filter the rows for Public schools in the Year 2014 Academic Session

pub_sch <- filter(college_scorecard,academic_year== 2014&ownership =="Public")
head(pub_sch[,8:13])
##   price_calculator_url under_investigation
## 1                 <NA>                <NA>
## 2                 <NA>                <NA>
## 3                 <NA>                <NA>
## 4                 <NA>                <NA>
## 5                 <NA>                <NA>
## 6                 <NA>                <NA>
##                 degrees_awarded.predominant degrees_awarded.highest
## 1  Predominantly bachelor's-degree granting         Graduate degree
## 2  Predominantly bachelor's-degree granting         Graduate degree
## 3  Predominantly bachelor's-degree granting         Graduate degree
## 4  Predominantly bachelor's-degree granting         Graduate degree
## 5  Predominantly bachelor's-degree granting         Graduate degree
## 6 Predominantly associate's-degree granting        Associate degree
##   ownership locale
## 1    Public   <NA>
## 2    Public   <NA>
## 3    Public   <NA>
## 4    Public   <NA>
## 5    Public   <NA>
## 6    Public   <NA>

We may be interested in filtering the rows for private nonprofit and public schools in Year 2014 academic session.

pub_pri <- filter(college_scorecard,academic_year== 2014& ownership %in% c("Public", "Private nonprofit"))
head(pub_pri[,8:13])
##   price_calculator_url under_investigation
## 1                 <NA>                <NA>
## 2                 <NA>                <NA>
## 3                 <NA>                <NA>
## 4                 <NA>                <NA>
## 5                 <NA>                <NA>
## 6                 <NA>                <NA>
##                degrees_awarded.predominant degrees_awarded.highest
## 1 Predominantly bachelor's-degree granting         Graduate degree
## 2 Predominantly bachelor's-degree granting         Graduate degree
## 3 Predominantly bachelor's-degree granting         Graduate degree
## 4 Predominantly bachelor's-degree granting         Graduate degree
## 5 Predominantly bachelor's-degree granting         Graduate degree
## 6 Predominantly bachelor's-degree granting         Graduate degree
##           ownership locale
## 1            Public   <NA>
## 2            Public   <NA>
## 3 Private nonprofit   <NA>
## 4            Public   <NA>
## 5            Public   <NA>
## 6            Public   <NA>

Using the Arrange() function

The knowledge of the pipe (%>%) operator is needed for the Arrange() function. Let’s talk a bit about the pipe (%>%) operator before discussing the Arrange() function. The pipe(%>%) operator makes it easy to run a complex code in just few lines.It allows sequencing of your data operations from left to right, as apposed to from inside and out. Take a look at the following example.

college_scorecard %>%select(size, grad_students) %>% head
##    size grad_students
## 1  4081           760
## 2 11014          6708
## 3   338           365
## 4  5703          1754
## 5  5123           686
## 6 27364          5477

What the codes above implies is, from the college_scorecard data, select size and grad_students then show the head of the data. For easy understanding, you can interprete the %>% operator as “then”

Now that we know what the pipe operator does,lets discuss the Arrange() function

What the arrange() does is to reorder the rows by given column (variable). For instance, we want to rearrange the rows by school ownership and then select certain columns

college_scorecard %>% arrange(ownership) %>% select(size,grad_students,name,ownership)%>%head()
##   size grad_students                                 name
## 1  704            78          South University-Montgomery
## 2  108            NA New Beginning College of Cosmetology
## 3  295            NA        Herzing University-Birmingham
## 4  122            NA           Prince Institute-Southeast
## 5 1259            NA                      Charter College
## 6  228            NA                Alaska Career College
##            ownership
## 1 Private for-profit
## 2 Private for-profit
## 3 Private for-profit
## 4 Private for-profit
## 5 Private for-profit
## 6 Private for-profit

We can also rerrange the rows by school size and then filter the rows for South University-Montgomery school.

college_scorecard %>% arrange(size) %>% filter(name=="South University-Montgomery")%>% select(size,grad_students,name,ownership,city,state)%>%head()
##   size grad_students                        name          ownership
## 1  394           123 South University-Montgomery Private for-profit
## 2  460           121 South University-Montgomery Private for-profit
## 3  522           101 South University-Montgomery Private for-profit
## 4  599            83 South University-Montgomery Private for-profit
## 5  704            78 South University-Montgomery Private for-profit
##         city state
## 1 Montgomery    AL
## 2 Montgomery    AL
## 3 Montgomery    AL
## 4 Montgomery    AL
## 5 Montgomery    AL

Create new columns using mutate()

We can create new column using the mutate() function. Lets create a new variable called “prop” which is the ratio of school size to grad_students

college_scorecard%>% mutate(prop=size/grad_students)%>%select(size,grad_students,name,ownership,city,prop)%>%head()
##    size grad_students                                name
## 1  4081           760            Alabama A & M University
## 2 11014          6708 University of Alabama at Birmingham
## 3   338           365                  Amridge University
## 4  5703          1754 University of Alabama in Huntsville
## 5  5123           686            Alabama State University
## 6 27364          5477           The University of Alabama
##           ownership       city      prop
## 1            Public     Normal 5.3697368
## 2            Public Birmingham 1.6419201
## 3 Private nonprofit Montgomery 0.9260274
## 4            Public Huntsville 3.2514253
## 5            Public Montgomery 7.4679300
## 6            Public Tuscaloosa 4.9961658

Using the Summarise() function

The Summarise() is very useful in creating summary statistics(min,max , mean and total) for a particular variable. Let’s obtain the minimum, maximum, mean and the total for the size. Before doing this, we removed the missing values(NA) in the dataset so as to get the summary statistics.

college_scorecard1<-select(college_scorecard, size)
col<-na.omit(college_scorecard1)
col %>% 
    summarise(avg_size = mean(size), 
              min_size = min(size),
              max_size = max(size),
              total = n())
##   avg_size min_size max_size total
## 1 2356.144        0   205286 34474

Using the group_by() function

The group_by() function groups a dataframe based on certain fields and makes summarise function interesting.Let’s obtain the mean for in state tuition fees for each state.

college_scorecard%>%group_by(state)%>%dplyr::summarise_at(vars(tuition.in_state), funs(mean(., na.rm=TRUE)))
## # A tibble: 59 x 2
##    state tuition.in_state
##    <fct>            <dbl>
##  1 AK               8481.
##  2 AL              10105.
##  3 AR               8366.
##  4 AS               3520 
##  5 AZ              10572.
##  6 CA              14087.
##  7 CO              12105.
##  8 CT              18169.
##  9 DC              22861.
## 10 DE              12431.
## # ... with 49 more rows
comments powered by Disqus