Introducing Data Science with R
, week 6 2019-10-17
在了解基本的 R
語法之後,我們要進入整頓資料 (data wrangling/transformation)(進而理解資料)的學習階段。
首先你會拿到資料,但是不整齊 (non-tidy) 的資料讓妳的人生被浪費。
幸福的人都很類似,不幸的人則各有各的不幸。 Tidy datasets are all alike, but every messy dataset is messy in its own way.” –– Hadley Wickham
那整齊的資料是什麼意思?
三原則
開發團隊有個 整體架構 tidyverse, 希望形成一個資料科學的套件生態圈,共享處理資料的哲學、語法、與資料結構。 這個想法甚至引發了 R 語言教育的路線之爭 (R base first vs Tidyverse first
)
我們先掌握兩個核心的套件架構即可。
dplyr
: a Grammar of Data Manipulationggplot2
: a Grammar of Graphics在 DataCamp 上規劃的 Introduction to the Tidyverse
的課程,課程時數約四小時,請大家找時間預習與練習。
dplyr
Data manipulation 五個基本動作:filter()
, select()
, arrange()
, mutate()
, summarise()
。 注意:直行(VAR)橫列 (OBS)
filter()
: 按給定的邏輯判斷,篩選出符合要求的 OBS, 類似於 subset()
。select()
: 用 VAR 作參數來選擇 OBS。arrange()
: 按給定的 VAR 依次對 OBS 進行排序。類似於 order()
。mutate()
: 對已有 VAR 進行運算並添加為新的 VAR。類似於 transform()
。summarise()
: 對data frame 調用其它函數進行 summarise, 並回傳一維結果。First argument is a data frame, and subsequent arguments say what to do with data frame.
# load packages
suppressMessages(library(dplyr))
# install.packages(c("hflights", "Lahman"))
library(hflights)
# explore data: flights departing from two Houston airports in 2011
data(hflights)
head(hflights,5)
tbl_df()
creates a 「local data frame」, which is simply a wrapper for a data frame that prints nicely.# convert to local data frame
flights <- tbl_df(hflights)
# alternative: head(data.frame(flights),3)
# printing only shows 10 rows and as many columns as can fit on your screen
flights
# you can specify that you want to see more rows
#print(flights, n=20)
# convert to a normal data frame to see all of the columns
#data.frame(head(flights))
filter
(keep rows marching criteria) : 篩選觀察filter(df, 回傳符合邏輯條件的 rows)
# base R approach to view all flights on January 1
#flights[flights$Month==1 & flights$DayofMonth==1, ]
# dplyr approach
# note: you can use comma or ampersand to represent AND condition
filter(flights, Month==1 & DayofMonth==1)
# use pipe for OR condition
filter(flights, UniqueCarrier=="AA" | UniqueCarrier=="UA")
# you can also use %in% operator
#filter(flights, UniqueCarrier %in% c("AA", "UA"))
select
: Pick columns by name 選取變量# base R approach to select DepTime, ArrTime, and FlightNum columns
#flights[, c("DepTime", "ArrTime", "FlightNum")]
# dplyr approach
select(flights, DepTime, ArrTime, FlightNum)
# use colon to select multiple contiguous columns, and use `contains` to match columns by name
# note: `starts_with`, `ends_with`, and `matches` (for regular expressions) can also be used to match columns by name
# 或者使用 - 來排除某列
select(flights, Year:DayofMonth, contains("Taxi"), contains("Delay"))
Chaining/Pipelining - Usual way to perform multiple operations in one line is by nesting. - Can write commands in a natural order by using the
%>%
infix operator (which can be pronounced as 「then」) - Chaining increases readability significantly when there are many commands (Operator is automatically imported from themagrittr
package)
# nesting method to select UniqueCarrier and DepDelay columns and filter for delays over 60 minutes
filter(select(flights, UniqueCarrier, DepDelay), DepDelay > 60)
# chaining method
flights %>%
select(UniqueCarrier, DepDelay) %>%
filter(DepDelay > 60)
再一個例子
# create two vectors and calculate Euclidian distance between them
x1 <- 1:5; x2 <- 2:6
sqrt(sum((x1-x2)^2))
# chaining method
(x1-x2)^2 %>%
sum() %>%
sqrt()
rename()
重命名變量 variable names rename(tbl, newname = oldname,...)
arrange()
: 排序觀察 Reorder rowsarrange(依 variable 或是 variables 來排序)
, desc()
是倒序。# base R approach to select UniqueCarrier and DepDelay columns and sort by DepDelay
#flights[order(flights$DepDelay), c("UniqueCarrier", "DepDelay")]
# dplyr approach
flights %>%
select(UniqueCarrier, DepDelay) %>%
arrange(DepDelay)
# use `desc` for descending
# flights %>%
# select(UniqueCarrier, DepDelay) %>%
# arrange(desc(DepDelay))
#
mutate()
: Add new variables 擴展新 variable(在後面)transform()
# base R approach to create a new variable Speed (in mph)
#flights$Speed <- flights$Distance / flights$AirTime*60
#flights[, c("Distance", "AirTime", "Speed")]
# dplyr approach (prints the new variable but does not store it)
flights %>%
select(Distance, AirTime) %>%
mutate(Speed = Distance/AirTime*60)
# store the new variable
flights <- flights %>%
mutate(Speed = Distance/AirTime*60)
summarise()
: Reduce variables to valuessummarise()
uses the provided aggregation function to summarise each groupgroup_by()
這個概念結合起來時非常強大!# base R approaches to calculate the average arrival delay to each destination
# head(with(
# flights,
# tapply(ArrDelay, Dest, mean, na.rm=TRUE))
# )
# head(aggregate(ArrDelay ~ Dest, flights, mean))
# dplyr approach: create a table grouped by Dest, and then summarise each group by taking the mean of ArrDelay
flights %>%
group_by(Dest) %>%
summarise(avg_delay = mean(ArrDelay, na.rm=TRUE))
summarise_each()
allows you to apply the same summary function to multiple columns at once (Note: mutate_each()
is also available)# for each carrier, calculate the percentage of flights cancelled or diverted
flights %>%
group_by(UniqueCarrier) %>%
summarise_each(funs(mean), Cancelled, Diverted)
# for each carrier, calculate the minimum and maximum arrival and departure delays
flights %>%
group_by(UniqueCarrier) %>%
summarise_each(funs(min(., na.rm=TRUE), max(., na.rm=TRUE)), matches("Delay"))
n()
counts the number of rows in a groupn_distinct(vector)
counts the number of unique items in that vector# for each day of the year, count the total number of flights and sort in descending order
flights %>%
group_by(Month, DayofMonth) %>%
summarise(flight_count = n()) %>%
arrange(desc(flight_count))
# for each destination, count the total number of flights and the number of distinct planes that flew there
flights %>%
group_by(Dest) %>%
summarise(flight_count = n(), plane_count = n_distinct(TailNum))
# for each destination, show the number of cancelled and not cancelled flights
flights %>%
group_by(Dest) %>%
select(Cancelled) %>%
table() %>%
head()
min_rank()
), offset functions (lead()
and lag()
), and cumulative aggregates (like cummean()
).ggplot2
作圖是 EDA 的一把瑞士刀。在應用技術之前,可以先想想
先看看這個 plots to avoid
plot()
- qplot()
- ggplot()
>> interactive plot rCharts, plotly, networkD3, dygraphs… (視你的應用需求而定)。 有些套件本身就足夠滿足妳的需求。
#install.packages('gtrendsR')
library(gtrendsR)
trends <- gtrends(c("Nerds", "Smarties"), geo ="CA")
plot(trends)
#install.packages('plotly')
library(plotly)
p <-plot(trends)
ggplotly(p)
ggplot2
的基本文法gg
代表 grammar of graphics
(data, aesthetics) + geometry
data
: a data frameaesthetics
: used to indicate x and y variables, also used to control the color, size, shape of points, heights of bars, etc.geometry
: corresponds to the type of graphics (histogram, box plot,…)library(ggplot2)
gg <- ggplot(diamonds, aes(price, carat)) +
geom_point(color = "brown4") # scatter plot; size=1.5, shape=18
gg
gg <- gg + ggtitle("Diamond carat and price")
diamonds %>% # Start with the 'diamonds' dataset
filter(cut == "Ideal") %>% # Then, filter down to rows where cut == Ideal
ggplot(aes(x=color,y=price)) + # Then, plot using ggplot
geom_boxplot() # with and create a boxplot
再回來玩 gtrendsR
library(gtrendsR)
# https://rdrr.io/cran/gtrendsR/man/gtrends.html
#define the keywords
keywords=c("Paris","New York","Barcelona")
#set the geographic area: DE = Germany
country=c('TW')
#set the time window
time=("2010-01-01 2018-08-27")
#set channels
channel='web'
trends = gtrends(keywords, gprop =channel,geo=country, time = time )
#select only interst over time
time_trend=trends$interest_over_time
head(time_trend)
plot<-ggplot(data=time_trend, aes(x=date, y=hits,group=keyword,col=keyword))+
geom_line()+xlab('Time')+ylab('Relative Interest')+ theme_bw()+
theme(legend.title = element_blank(),legend.position="bottom",legend.text=element_text(size=12))+ggtitle("Google Search Volume")
plot
中文資料參考
http://molecular-service-science.com/2013/11/27/r-ggplot-tutorial-1/
http://molecular-service-science.com/2014/01/23/r-ggplot-tutorial-2/