library(tidyverse)
churn<-readr::read_csv("Databel_Data.csv")
churn |>
head(10) |>
flextable::flextable()
Introduction to Customer Churning
Comparing R and Powerbi DAX
- this blog aims to compare how things are done in R and Powerbi
Data Analysis Expressions (DAX) is a formula expression language used in Analysis Services, Power BI, and Power Pivot in Excel. DAX formulas include functions, operators, and values to perform advanced calculations and queries on data in related tables and columns in tabular data models.
- This article provides only a basic introduction to the most important concepts in DAX while comparing them to R
Data exploration
- before any analysis , it is mandatory to learn more about the rows in our data
- things like the
number of rows in our data
andduplicate columns
in our data are very important to understand
churn |>
summarise(`number of customers` = n(),
`distinct customers` = n_distinct(`Customer ID`))
# A tibble: 1 × 2
`number of customers` `distinct customers`
<int> <int>
1 6687 6687
OR
(`Number of Customers` <- count(churn))
# A tibble: 1 × 1
n
<int>
1 6687
- in powerbi our dataframe is
Databel - Data
and to index our data we use'Databel - Data'[variable]
to find the number of rows we use
COUNT()
orCOUNTROWS()
= COUNT('Databel - Data'[Customer ID]) Number of Customers
to find the number of UNIQUE customers we use
DISTINCTCOUNT()
orCOUNTROWS()
= DISTINCTCOUNT('Databel - Data'[Customer ID]) Number of Customers
Data manipulation
IF STATEMENTS
in dplyr we use
ifelse()
in conjuction withmutate()
to create a new variable based on a certain condition or criteria
# A tibble: 6,687 × 30
churned `Customer ID` `Churn Label` Account Length (in months…¹ `Local Calls`
<dbl> <chr> <chr> <dbl> <dbl>
1 0 4444-BZPU No 1 3
2 0 5676-PTZX No 33 179
3 0 8532-ZEKQ No 44 82
4 0 1314-SMPJ No 10 47
5 0 2956-TXCJ No 62 184
6 0 9152-DEPY No 17 68
7 0 1958-SDSO No 57 428
8 0 8787-QZUC No 25 54
9 0 7768-OQJE No 70 171
10 0 7716-RHEB No 50 206
# ℹ 6,677 more rows
# ℹ abbreviated name: ¹`Account Length (in months)`
# ℹ 25 more variables: `Local Mins` <dbl>, `Intl Calls` <dbl>,
# `Intl Mins` <dbl>, `Intl Active` <chr>, `Intl Plan` <chr>,
# `Extra International Charges` <dbl>, `Customer Service Calls` <dbl>,
# `Avg Monthly GB Download` <dbl>, `Unlimited Data Plan` <chr>,
# `Extra Data Charges` <dbl>, State <chr>, `Phone Number` <chr>, …
in
DAX
we useIF()
, uses the same idea asR
= IF('Databel - Data'[Churn Label]="Yes",1,0) Churned
Calculate churn Rate
for that we first calculate the number of
customers that churned
,
- in R we can sum over the
churned
column since we know it contains Zeros and ones
(`Number of Churned Customers` = sum(churn$churned))
[1] 1796
- about
1796
people churned
do it in
DAX
, we usesum() or SUM()
just like R
The only difference is we used $
in R
= SUM('Databel - Data'[Churned]) Number of Churned Customers
(`Churn Rate` = `Number of Churned Customers`/`Number of Customers`$n)* 100
[1] 26.85808
Churn Rate in PowerBi , just like in R we use
/
to imply divide
= [Number of Churned Customers]/[Number of Customers] Churn Rate
overally we get
26.86%
churn rate
Lets determine the frequent reasaons for churning
in R we use ggplot2 package to plot our data
churn |>
na.omit() |>
group_by(`Churn Reason`) |>
summarise(n = n()) |>
ggplot(aes(x= fct_reorder(`Churn Reason`,n) , y= n))+
geom_col(fill="blue")+
coord_flip()+
ggthemes::theme_economist()+
labs(x="Churn Reason",y="number of customers")
in Powerbi we simply selected a
clustered bar chart
More advanced data manipulation
table(churn$`Contract Type`)
Month-to-Month One Year Two Year
3411 1479 1797
suppose we intend to have just
year
andMonthly
contract types , we can choose to bundle upOne year
andTwo Year
into justYear
category.
(churn<-churn |>
mutate(`Contract Category` = case_when(`Contract Type` =="One Year"~ "Yearly",
`Contract Type` =="Two Year"~ "Yearly",
`Contract Type` =="Month-to-Month"~ "Monthly")) |>
relocate(`Contract Category`))
# A tibble: 6,687 × 31
`Contract Category` churned `Customer ID` `Churn Label`
<chr> <dbl> <chr> <chr>
1 Monthly 0 4444-BZPU No
2 Yearly 0 5676-PTZX No
3 Yearly 0 8532-ZEKQ No
4 Monthly 0 1314-SMPJ No
5 Yearly 0 2956-TXCJ No
6 Yearly 0 9152-DEPY No
7 Yearly 0 1958-SDSO No
8 Monthly 0 8787-QZUC No
9 Yearly 0 7768-OQJE No
10 Yearly 0 7716-RHEB No
# ℹ 6,677 more rows
# ℹ 27 more variables: `Account Length (in months)` <dbl>, `Local Calls` <dbl>,
# `Local Mins` <dbl>, `Intl Calls` <dbl>, `Intl Mins` <dbl>,
# `Intl Active` <chr>, `Intl Plan` <chr>,
# `Extra International Charges` <dbl>, `Customer Service Calls` <dbl>,
# `Avg Monthly GB Download` <dbl>, `Unlimited Data Plan` <chr>,
# `Extra Data Charges` <dbl>, State <chr>, `Phone Number` <chr>, …
SWITCH(<expression>, <value>, <result>[, <value>, <result>]…[, <else>])
= SWITCH('Databel - Data'[Contract Type],
Contract Category "One Year", "Yearly",
"Two Year", "Yearly",
"Monthly")
Churn Rates By state
In R we use the power of
mutate
andsprintf()
churn |>
group_by(State) |>
summarise(`# of customers` = n(),
`# of churned customers` = sum(churned),
`Churn Rate` = sprintf("%.2f%%",`# of churned customers`/`# of customers`* 100)) |>
knitr::kable()
State | # of customers | # of churned customers | Churn Rate |
---|---|---|---|
AK | 104 | 30 | 28.85% |
AL | 161 | 46 | 28.57% |
AR | 110 | 27 | 24.55% |
AZ | 128 | 30 | 23.44% |
CA | 68 | 43 | 63.24% |
CO | 132 | 29 | 21.97% |
CT | 148 | 38 | 25.68% |
DC | 108 | 21 | 19.44% |
DE | 122 | 37 | 30.33% |
FL | 127 | 27 | 21.26% |
GA | 108 | 25 | 23.15% |
HI | 106 | 28 | 26.42% |
IA | 90 | 24 | 26.67% |
ID | 147 | 41 | 27.89% |
IL | 116 | 33 | 28.45% |
IN | 143 | 42 | 29.37% |
KS | 141 | 36 | 25.53% |
KY | 118 | 36 | 30.51% |
LA | 103 | 25 | 24.27% |
MA | 131 | 33 | 25.19% |
MD | 140 | 46 | 32.86% |
ME | 124 | 29 | 23.39% |
MI | 146 | 40 | 27.40% |
MN | 168 | 38 | 22.62% |
MO | 127 | 36 | 28.35% |
MS | 130 | 37 | 28.46% |
MT | 137 | 43 | 31.39% |
NC | 136 | 28 | 20.59% |
ND | 124 | 31 | 25.00% |
NE | 122 | 40 | 32.79% |
NH | 112 | 36 | 32.14% |
NJ | 137 | 37 | 27.01% |
NM | 124 | 30 | 24.19% |
NV | 132 | 36 | 27.27% |
NY | 167 | 39 | 23.35% |
OH | 158 | 55 | 34.81% |
OK | 123 | 24 | 19.51% |
OR | 156 | 48 | 30.77% |
PA | 90 | 30 | 33.33% |
RI | 131 | 33 | 25.19% |
SC | 120 | 36 | 30.00% |
SD | 120 | 27 | 22.50% |
TN | 106 | 24 | 22.64% |
TX | 145 | 42 | 28.97% |
UT | 144 | 36 | 25.00% |
VA | 155 | 42 | 27.10% |
VT | 147 | 39 | 26.53% |
WA | 132 | 29 | 21.97% |
WI | 156 | 39 | 25.00% |
WV | 213 | 57 | 26.76% |
WY | 154 | 38 | 24.68% |
in powerbi we use a
table
to summarise our data just by clicking
Filtering
- filtering is a common pocedure in data science
here is how we do it in R as compared to DAX
- In R we use
filter() then summarise()
to get the mean for only a certain category
CALCULATE(<expression>[, <filter1> [, <filter2> [, …]]])
in direct debit = CALCULATE(AVERAGE('Databel - Data'[Monthly Charge]),
avg total charges 'Databel - Data'[Payment Method]="Direct Debit")
churn |>
filter(`Payment Method` == "Direct Debit") |>
summarise(`count customers in direct Debit` = n_distinct(`Customer ID`))
# A tibble: 1 × 1
`count customers in direct Debit`
<int>
1 3702
in direct debit = CALCULATE(DISTINCTCOUNT('Databel - Data'[CUSTOMER ID]),
avg total charges 'Databel - Data'[Payment Method]="Direct Debit")