The dplyr is an important R-package used in transforming and summarizing tabular data in R.It can be used to perform major common data manipulation functions such as selection of specific columns, filtering and rearranging rows, adding new columns and grouping rows and columns.
Let’s start by loading the dplyr package into Rstudio
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
For this tutorial, we will be using the College Scorecard project dataset. The College Scorecard project provides data to help students and families compare college costs and outcomes as they weigh the tradeoffs of different colleges, accounting for their own needs and educational goals. You can click here to download the College Scorecard project data
We load the data into R
college_scorecard <- readRDS("C:/Users/factual/college_scorecard.rds")
In this tutorial, we will be exploring the following dplyr functions.
arrange() for rearranging or reordering rows
filter() for filtering rows
select() for selecting columns
mutate() for creating new columns
group_by() for goruping operations
summarise() for summarising values
Let’s select the name, size and grad_students (number of graduate students in each institution) from the dataset
grad_pop<-select(college_scorecard,name, size, grad_students)
head(grad_pop)
## name size grad_students
## 1 Alabama A & M University 4081 760
## 2 University of Alabama at Birmingham 11014 6708
## 3 Amridge University 338 365
## 4 University of Alabama in Huntsville 5703 1754
## 5 Alabama State University 5123 686
## 6 The University of Alabama 27364 5477
We can also select all the columns except a specific column by using negative indexing “-”
grad_pop2<- select(grad_pop, -name )
head(grad_pop2)
## size grad_students
## 1 4081 760
## 2 11014 6708
## 3 338 365
## 4 5703 1754
## 5 5123 686
## 6 27364 5477
To select all the columns except a specific group of columns
grad_pop3<- select(grad_pop, -c(name,size) )
head(grad_pop3)
## grad_students
## 1 760
## 2 6708
## 3 365
## 4 1754
## 5 686
## 6 5477
To select a range of columns, we can use the colon (:) operator
range_select<-select(college_scorecard, part_time_share:avg_net_price.private )
head(range_select)
## part_time_share operating avg_net_price.public avg_net_price.private
## 1 0.0762 NA 11726 NA
## 2 0.2533 NA 15123 NA
## 3 0.3728 NA NA 6927
## 4 0.2271 NA 11160 NA
## 5 0.0859 NA 11029 NA
## 6 0.0824 NA 18526 NA
We can use the function starts_with() to select all columns that start with any character string, e.g “avg”
chr_select<-select(college_scorecard, starts_with("avg"))
head(chr_select)
## avg_net_price.public avg_net_price.private
## 1 11726 NA
## 2 15123 NA
## 3 NA 6927
## 4 11160 NA
## 5 11029 NA
## 6 18526 NA
We can also use the function ends_with() to select all columns that end with any character string, e.g “ic”
chr_select2<-select(college_scorecard, ends_with("ic"))
head(chr_select2)
## minority_serving.hispanic demographics.race_ethnicity.hispanic
## 1 NA 0.0037
## 2 NA 0.0246
## 3 NA 0.0089
## 4 NA 0.0344
## 5 NA 0.0102
## 6 NA 0.0282
## avg_net_price.public
## 1 11726
## 2 15123
## 3 NA
## 4 11160
## 5 11029
## 6 18526
Let’s filter the rows for Public schools in the Year 2014 Academic Session
pub_sch <- filter(college_scorecard,academic_year== 2014&ownership =="Public")
head(pub_sch[,8:13])
## price_calculator_url under_investigation
## 1 <NA> <NA>
## 2 <NA> <NA>
## 3 <NA> <NA>
## 4 <NA> <NA>
## 5 <NA> <NA>
## 6 <NA> <NA>
## degrees_awarded.predominant degrees_awarded.highest
## 1 Predominantly bachelor's-degree granting Graduate degree
## 2 Predominantly bachelor's-degree granting Graduate degree
## 3 Predominantly bachelor's-degree granting Graduate degree
## 4 Predominantly bachelor's-degree granting Graduate degree
## 5 Predominantly bachelor's-degree granting Graduate degree
## 6 Predominantly associate's-degree granting Associate degree
## ownership locale
## 1 Public <NA>
## 2 Public <NA>
## 3 Public <NA>
## 4 Public <NA>
## 5 Public <NA>
## 6 Public <NA>
We may be interested in filtering the rows for private nonprofit and public schools in Year 2014 academic session.
pub_pri <- filter(college_scorecard,academic_year== 2014& ownership %in% c("Public", "Private nonprofit"))
head(pub_pri[,8:13])
## price_calculator_url under_investigation
## 1 <NA> <NA>
## 2 <NA> <NA>
## 3 <NA> <NA>
## 4 <NA> <NA>
## 5 <NA> <NA>
## 6 <NA> <NA>
## degrees_awarded.predominant degrees_awarded.highest
## 1 Predominantly bachelor's-degree granting Graduate degree
## 2 Predominantly bachelor's-degree granting Graduate degree
## 3 Predominantly bachelor's-degree granting Graduate degree
## 4 Predominantly bachelor's-degree granting Graduate degree
## 5 Predominantly bachelor's-degree granting Graduate degree
## 6 Predominantly bachelor's-degree granting Graduate degree
## ownership locale
## 1 Public <NA>
## 2 Public <NA>
## 3 Private nonprofit <NA>
## 4 Public <NA>
## 5 Public <NA>
## 6 Public <NA>
The knowledge of the pipe (%>%) operator is needed for the Arrange() function. Let’s talk a bit about the pipe (%>%) operator before discussing the Arrange() function. The pipe(%>%) operator makes it easy to run a complex code in just few lines.It allows sequencing of your data operations from left to right, as apposed to from inside and out. Take a look at the following example.
college_scorecard %>%select(size, grad_students) %>% head
## size grad_students
## 1 4081 760
## 2 11014 6708
## 3 338 365
## 4 5703 1754
## 5 5123 686
## 6 27364 5477
What the codes above implies is, from the college_scorecard data, select size and grad_students then show the head of the data. For easy understanding, you can interprete the %>% operator as “then”
Now that we know what the pipe operator does,lets discuss the Arrange() function
What the arrange() does is to reorder the rows by given column (variable). For instance, we want to rearrange the rows by school ownership and then select certain columns
college_scorecard %>% arrange(ownership) %>% select(size,grad_students,name,ownership)%>%head()
## size grad_students name
## 1 704 78 South University-Montgomery
## 2 108 NA New Beginning College of Cosmetology
## 3 295 NA Herzing University-Birmingham
## 4 122 NA Prince Institute-Southeast
## 5 1259 NA Charter College
## 6 228 NA Alaska Career College
## ownership
## 1 Private for-profit
## 2 Private for-profit
## 3 Private for-profit
## 4 Private for-profit
## 5 Private for-profit
## 6 Private for-profit
We can also rerrange the rows by school size and then filter the rows for South University-Montgomery school.
college_scorecard %>% arrange(size) %>% filter(name=="South University-Montgomery")%>% select(size,grad_students,name,ownership,city,state)%>%head()
## size grad_students name ownership
## 1 394 123 South University-Montgomery Private for-profit
## 2 460 121 South University-Montgomery Private for-profit
## 3 522 101 South University-Montgomery Private for-profit
## 4 599 83 South University-Montgomery Private for-profit
## 5 704 78 South University-Montgomery Private for-profit
## city state
## 1 Montgomery AL
## 2 Montgomery AL
## 3 Montgomery AL
## 4 Montgomery AL
## 5 Montgomery AL
We can create new column using the mutate() function. Lets create a new variable called “prop” which is the ratio of school size to grad_students
college_scorecard%>% mutate(prop=size/grad_students)%>%select(size,grad_students,name,ownership,city,prop)%>%head()
## size grad_students name
## 1 4081 760 Alabama A & M University
## 2 11014 6708 University of Alabama at Birmingham
## 3 338 365 Amridge University
## 4 5703 1754 University of Alabama in Huntsville
## 5 5123 686 Alabama State University
## 6 27364 5477 The University of Alabama
## ownership city prop
## 1 Public Normal 5.3697368
## 2 Public Birmingham 1.6419201
## 3 Private nonprofit Montgomery 0.9260274
## 4 Public Huntsville 3.2514253
## 5 Public Montgomery 7.4679300
## 6 Public Tuscaloosa 4.9961658
The Summarise() is very useful in creating summary statistics(min,max , mean and total) for a particular variable. Let’s obtain the minimum, maximum, mean and the total for the size. Before doing this, we removed the missing values(NA) in the dataset so as to get the summary statistics.
college_scorecard1<-select(college_scorecard, size)
col<-na.omit(college_scorecard1)
col %>%
summarise(avg_size = mean(size),
min_size = min(size),
max_size = max(size),
total = n())
## avg_size min_size max_size total
## 1 2356.144 0 205286 34474
The group_by() function groups a dataframe based on certain fields and makes summarise function interesting.Let’s obtain the mean for in state tuition fees for each state.
college_scorecard%>%group_by(state)%>%dplyr::summarise_at(vars(tuition.in_state), funs(mean(., na.rm=TRUE)))
## # A tibble: 59 x 2
## state tuition.in_state
## <fct> <dbl>
## 1 AK 8481.
## 2 AL 10105.
## 3 AR 8366.
## 4 AS 3520
## 5 AZ 10572.
## 6 CA 14087.
## 7 CO 12105.
## 8 CT 18169.
## 9 DC 22861.
## 10 DE 12431.
## # ... with 49 more rows