Warning: package 'ggplot2' was built under R version 4.3.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
[1] "M" "F" "F" "F" "M" "M" "F"
gender_factor <- factor ( gender ,levels= c ( "M" ,"F" ) )
gender_factor
[1] M F F F M M F
Levels: M F
Generalised linear models
recap of linear regresion assumptions
the response variable is normality
set.seed ( 123 )
interest <- rnorm ( 20 ,175 ,20 )
status <- c ( 0 ,0 ,1 ,0 ,1 ,1 ,1 ,1 ,0 ,0 ,0 ,1 ,0 ,1 ,0 ,1 ,0 ,0 ,1 ,0 )
Call:
lm(formula = status ~ interest)
Residuals:
Min 1Q Median 3Q Max
-0.7079 -0.3229 -0.1001 0.4309 0.8852
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.66863 0.98543 -1.693 0.1076
interest 0.01191 0.00551 2.162 0.0443 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4672 on 18 degrees of freedom
Multiple R-squared: 0.2062, Adjusted R-squared: 0.1621
F-statistic: 4.675 on 1 and 18 DF, p-value: 0.04432
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
first transformation
second transformation
which simplies to
status_new <- 1 / ( 1 + exp ( - ( - 1.66863 + 0.01191 * ( interest ) ) ) )
status_new
[1] 0.5700646 0.5892321 0.6871662 0.6064502 0.6097868 0.6951162 0.6284120
[8] 0.5285395 0.5626716 0.5767543 0.6697813 0.6227713 0.6250605 0.6087318
[15] 0.5703351 0.6987311 0.6304640 0.4867960 0.6416845 0.5751760
model_fit <- glm ( status ~ interest ,data= df ,family = binomial ( link= "logit" ) )
summary ( model_fit )
Call:
glm(formula = status ~ interest, family = binomial(link = "logit"),
data = df)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -10.97940 5.93445 -1.850 0.0643 .
interest 0.06032 0.03302 1.827 0.0677 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 27.526 on 19 degrees of freedom
Residual deviance: 22.930 on 18 degrees of freedom
AIC: 26.93
Number of Fisher Scoring iterations: 4
options ( scipen = 999 )
model_fit <- glm ( status ~ interest ,data= df ,family = binomial ( link= "logit" ) )
tidy ( model_fit ,exponentiate = TRUE )
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 0.0000170 5.93 -1.85 0.0643
2 interest 1.06 0.0330 1.83 0.0677
library ( tidyverse )
cust <- readr :: read_csv ( "customer_data.csv" ) |>
select ( gender ,tenure ,SeniorCitizen ,InternetService ,Contract ,PaymentMethod ,MonthlyCharges ,TotalCharges ,Churn )
Rows: 7043 Columns: 21
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (17): customerID, gender, Partner, Dependents, PhoneService, MultipleLin...
dbl (4): SeniorCitizen, tenure, MonthlyCharges, TotalCharges
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 7,043 × 9
gender tenure SeniorCitizen InternetService Contract PaymentMethod
<chr> <dbl> <dbl> <chr> <chr> <chr>
1 Female 1 0 DSL Month-to-month Electronic check
2 Male 34 0 DSL One year Mailed check
3 Male 2 0 DSL Month-to-month Mailed check
4 Male 45 0 DSL One year Bank transfer (au…
5 Female 2 0 Fiber optic Month-to-month Electronic check
6 Female 8 0 Fiber optic Month-to-month Electronic check
7 Male 22 0 Fiber optic Month-to-month Credit card (auto…
8 Female 10 0 DSL Month-to-month Mailed check
9 Female 28 0 Fiber optic Month-to-month Electronic check
10 Male 62 0 DSL One year Bank transfer (au…
# ℹ 7,033 more rows
# ℹ 3 more variables: MonthlyCharges <dbl>, TotalCharges <dbl>, Churn <chr>
clean_data <- cust |>
mutate_if ( is.character ,as.factor )
clean_data
# A tibble: 7,043 × 9
gender tenure SeniorCitizen InternetService Contract PaymentMethod
<fct> <dbl> <dbl> <fct> <fct> <fct>
1 Female 1 0 DSL Month-to-month Electronic check
2 Male 34 0 DSL One year Mailed check
3 Male 2 0 DSL Month-to-month Mailed check
4 Male 45 0 DSL One year Bank transfer (au…
5 Female 2 0 Fiber optic Month-to-month Electronic check
6 Female 8 0 Fiber optic Month-to-month Electronic check
7 Male 22 0 Fiber optic Month-to-month Credit card (auto…
8 Female 10 0 DSL Month-to-month Mailed check
9 Female 28 0 Fiber optic Month-to-month Electronic check
10 Male 62 0 DSL One year Bank transfer (au…
# ℹ 7,033 more rows
# ℹ 3 more variables: MonthlyCharges <dbl>, TotalCharges <dbl>, Churn <fct>
conditional statements
the function ifelse() is a conditional statement used to create new varables
the function takes the following arguments
ifelse(condition,value when true, otherwise)
clean_data $ status <- ifelse ( clean_data $ Churn == "Yes" ,1 ,0 )
clean_data
# A tibble: 7,043 × 10
gender tenure SeniorCitizen InternetService Contract PaymentMethod
<fct> <dbl> <dbl> <fct> <fct> <fct>
1 Female 1 0 DSL Month-to-month Electronic check
2 Male 34 0 DSL One year Mailed check
3 Male 2 0 DSL Month-to-month Mailed check
4 Male 45 0 DSL One year Bank transfer (au…
5 Female 2 0 Fiber optic Month-to-month Electronic check
6 Female 8 0 Fiber optic Month-to-month Electronic check
7 Male 22 0 Fiber optic Month-to-month Credit card (auto…
8 Female 10 0 DSL Month-to-month Mailed check
9 Female 28 0 Fiber optic Month-to-month Electronic check
10 Male 62 0 DSL One year Bank transfer (au…
# ℹ 7,033 more rows
# ℹ 4 more variables: MonthlyCharges <dbl>, TotalCharges <dbl>, Churn <fct>,
# status <dbl>
Model building
null_model <- glm ( status ~ 1 ,data= clean_data ,family= binomial )
summary ( null_model )
Call:
glm(formula = status ~ 1, family = binomial, data = clean_data)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.01824 0.02699 -37.73 <0.0000000000000002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 8150.1 on 7042 degrees of freedom
Residual deviance: 8150.1 on 7042 degrees of freedom
AIC: 8152.1
Number of Fisher Scoring iterations: 4
full_model <- glm ( status ~ . - Churn ,data= clean_data ,family= binomial )
summary ( full_model )
Call:
glm(formula = status ~ . - Churn, family = binomial, data = clean_data)
Coefficients:
Estimate Std. Error z value
(Intercept) -0.0777537 0.2172993 -0.358
genderMale -0.0182879 0.0638132 -0.287
tenure -0.0601592 0.0062346 -9.649
SeniorCitizen 0.3376849 0.0812063 4.158
InternetServiceFiber optic 0.9632804 0.1306020 7.376
InternetServiceNo -0.9545950 0.1573155 -6.068
ContractOne year -0.7616137 0.1049811 -7.255
ContractTwo year -1.5760053 0.1735733 -9.080
PaymentMethodCredit card (automatic) -0.0718050 0.1122814 -0.640
PaymentMethodElectronic check 0.4147766 0.0926149 4.479
PaymentMethodMailed check -0.1110753 0.1127085 -0.986
MonthlyCharges -0.0045429 0.0034972 -1.299
TotalCharges 0.0003421 0.0000706 4.845
Pr(>|z|)
(Intercept) 0.720
genderMale 0.774
tenure < 0.0000000000000002 ***
SeniorCitizen 0.000032054208236 ***
InternetServiceFiber optic 0.000000000000163 ***
InternetServiceNo 0.000000001294893 ***
ContractOne year 0.000000000000402 ***
ContractTwo year < 0.0000000000000002 ***
PaymentMethodCredit card (automatic) 0.522
PaymentMethodElectronic check 0.000007516726374 ***
PaymentMethodMailed check 0.324
MonthlyCharges 0.194
TotalCharges 0.000001266940554 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 8143.4 on 7031 degrees of freedom
Residual deviance: 5966.0 on 7019 degrees of freedom
(11 observations deleted due to missingness)
AIC: 5992
Number of Fisher Scoring iterations: 6
tenure_model <- glm ( status ~ tenure ,data= clean_data ,family= binomial )
summary ( tenure_model )
Call:
glm(formula = status ~ tenure, family = binomial, data = clean_data)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.027739 0.042225 0.657 0.511
tenure -0.038783 0.001406 -27.593 <0.0000000000000002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 8150.1 on 7042 degrees of freedom
Residual deviance: 7191.4 on 7041 degrees of freedom
AIC: 7195.4
Number of Fisher Scoring iterations: 4
anova ( null_model ,tenure_model ,test = "LRT" )
Analysis of Deviance Table
Model 1: status ~ 1
Model 2: status ~ tenure
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 7042 8150.1
2 7041 7191.4 1 958.71 < 0.00000000000000022 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model3 <- glm ( status ~ tenure + Contract ,data= clean_data ,family= binomial )
summary ( model3 )
Call:
glm(formula = status ~ tenure + Contract, family = binomial,
data = clean_data)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.0007934 0.0437942 -0.018 0.986
tenure -0.0166984 0.0017125 -9.751 <0.0000000000000002 ***
ContractOne year -1.3999556 0.0956915 -14.630 <0.0000000000000002 ***
ContractTwo year -2.6359020 0.1609943 -16.373 <0.0000000000000002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 8150.1 on 7042 degrees of freedom
Residual deviance: 6665.1 on 7039 degrees of freedom
AIC: 6673.1
Number of Fisher Scoring iterations: 6
anova ( tenure_model ,model3 ,test = "LRT" )
Analysis of Deviance Table
Model 1: status ~ tenure
Model 2: status ~ tenure + Contract
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 7041 7191.4
2 7039 6665.1 2 526.33 < 0.00000000000000022 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model4 <- glm ( status ~ tenure + Contract + gender ,data= clean_data ,family= binomial )
summary ( model4 )
Call:
glm(formula = status ~ tenure + Contract + gender, family = binomial,
data = clean_data)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.020373 0.053091 0.384 0.701
tenure -0.016699 0.001713 -9.751 <0.0000000000000002 ***
ContractOne year -1.399564 0.095699 -14.625 <0.0000000000000002 ***
ContractTwo year -2.636360 0.160991 -16.376 <0.0000000000000002 ***
genderMale -0.042110 0.059718 -0.705 0.481
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 8150.1 on 7042 degrees of freedom
Residual deviance: 6664.6 on 7038 degrees of freedom
AIC: 6674.6
Number of Fisher Scoring iterations: 6
anova ( model3 ,model4 ,test = "LRT" )
Analysis of Deviance Table
Model 1: status ~ tenure + Contract
Model 2: status ~ tenure + Contract + gender
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 7039 6665.1
2 7038 6664.6 1 0.49725 0.4807
model5 <- glm ( status ~ tenure + Contract + MonthlyCharges ,data= clean_data ,family= binomial )
summary ( model5 )
Call:
glm(formula = status ~ tenure + Contract + MonthlyCharges, family = binomial,
data = clean_data)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.602659 0.089672 -17.87 <0.0000000000000002 ***
tenure -0.035739 0.002041 -17.51 <0.0000000000000002 ***
ContractOne year -1.062224 0.100542 -10.56 <0.0000000000000002 ***
ContractTwo year -2.021601 0.166171 -12.17 <0.0000000000000002 ***
MonthlyCharges 0.028619 0.001349 21.21 <0.0000000000000002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 8150.1 on 7042 degrees of freedom
Residual deviance: 6145.8 on 7038 degrees of freedom
AIC: 6155.8
Number of Fisher Scoring iterations: 6
Analysis of Deviance Table
Model 1: status ~ tenure + Contract + gender
Model 2: status ~ tenure + Contract + MonthlyCharges
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 7038 6664.6
2 7038 6145.8 0 518.82
model6 <- glm ( status ~ tenure + Contract + MonthlyCharges + PaymentMethod ,data= clean_data ,family= binomial )
summary ( model6 )
Call:
glm(formula = status ~ tenure + Contract + MonthlyCharges + PaymentMethod,
family = binomial, data = clean_data)
Coefficients:
Estimate Std. Error z value
(Intercept) -1.592074 0.124121 -12.827
tenure -0.034540 0.002082 -16.586
ContractOne year -0.964429 0.101512 -9.501
ContractTwo year -1.858607 0.167613 -11.089
MonthlyCharges 0.025172 0.001426 17.651
PaymentMethodCredit card (automatic) -0.066647 0.111533 -0.598
PaymentMethodElectronic check 0.486909 0.091804 5.304
PaymentMethodMailed check -0.129660 0.109946 -1.179
Pr(>|z|)
(Intercept) < 0.0000000000000002 ***
tenure < 0.0000000000000002 ***
ContractOne year < 0.0000000000000002 ***
ContractTwo year < 0.0000000000000002 ***
MonthlyCharges < 0.0000000000000002 ***
PaymentMethodCredit card (automatic) 0.550
PaymentMethodElectronic check 0.000000113 ***
PaymentMethodMailed check 0.238
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 8150.1 on 7042 degrees of freedom
Residual deviance: 6075.1 on 7035 degrees of freedom
AIC: 6091.1
Number of Fisher Scoring iterations: 6
Analysis of Deviance Table
Model 1: status ~ tenure + Contract + MonthlyCharges
Model 2: status ~ tenure + Contract + MonthlyCharges + PaymentMethod
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 7038 6145.8
2 7035 6075.1 3 70.718 0.000000000000002995 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model7 <- glm ( status ~ tenure + Contract + MonthlyCharges + PaymentMethod + InternetService ,data= clean_data ,family= binomial )
summary ( model7 )
Call:
glm(formula = status ~ tenure + Contract + MonthlyCharges + PaymentMethod +
InternetService, family = binomial, data = clean_data)
Coefficients:
Estimate Std. Error z value
(Intercept) -0.648118 0.175485 -3.693
tenure -0.031323 0.002110 -14.842
ContractOne year -0.801472 0.103679 -7.730
ContractTwo year -1.612038 0.170279 -9.467
MonthlyCharges 0.003748 0.003007 1.246
PaymentMethodCredit card (automatic) -0.066563 0.112079 -0.594
PaymentMethodElectronic check 0.446015 0.092411 4.826
PaymentMethodMailed check -0.070081 0.111345 -0.629
InternetServiceFiber optic 0.959376 0.127824 7.505
InternetServiceNo -0.803637 0.151449 -5.306
Pr(>|z|)
(Intercept) 0.000221 ***
tenure < 0.0000000000000002 ***
ContractOne year 0.0000000000000107 ***
ContractTwo year < 0.0000000000000002 ***
MonthlyCharges 0.212698
PaymentMethodCredit card (automatic) 0.552584
PaymentMethodElectronic check 0.0000013900527627 ***
PaymentMethodMailed check 0.529083
InternetServiceFiber optic 0.0000000000000612 ***
InternetServiceNo 0.0000001118522771 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 8150.1 on 7042 degrees of freedom
Residual deviance: 6010.4 on 7033 degrees of freedom
AIC: 6030.4
Number of Fisher Scoring iterations: 6
Analysis of Deviance Table
Model 1: status ~ tenure + Contract + MonthlyCharges + PaymentMethod
Model 2: status ~ tenure + Contract + MonthlyCharges + PaymentMethod +
InternetService
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 7035 6075.1
2 7033 6010.4 2 64.719 0.000000000000008838 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1