Untitled

Warning: package 'ggplot2' was built under R version 4.3.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(broom)
gender<-c("M","F","F","F","M","M","F")
gender
[1] "M" "F" "F" "F" "M" "M" "F"
class(gender)
[1] "character"
gender_factor<-factor(gender ,levels=c("M","F"))
gender_factor
[1] M F F F M M F
Levels: M F

Generalised linear models

recap of linear regresion assumptions

  • the response variable is normality
set.seed(123)
interest<-rnorm(20,175,20)
status<-c(0,0,1,0,1,1,1,1,0,0,0,1,0,1,0,1,0,0,1,0)
model<-lm(status~interest)
summary(model)

Call:
lm(formula = status ~ interest)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.7079 -0.3229 -0.1001  0.4309  0.8852 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept) -1.66863    0.98543  -1.693   0.1076  
interest     0.01191    0.00551   2.162   0.0443 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4672 on 18 degrees of freedom
Multiple R-squared:  0.2062,    Adjusted R-squared:  0.1621 
F-statistic: 4.675 on 1 and 18 DF,  p-value: 0.04432

status=β0+β1(interest)

-1.668+0.01*270
[1] 1.032
df<-cbind(interest,status) |> 
  as.data.frame()

ggplot(df,aes(x=interest,y=status))+
  geom_point()+
  geom_smooth(method="lm")
`geom_smooth()` using formula = 'y ~ x'

ggplot(df,aes(x=interest,y=status))+
  geom_point()+
  geom_smooth(method="glm",method.args=list(family=binomial))
`geom_smooth()` using formula = 'y ~ x'

y=a+bx

first transformation

y=ea+bx

second transformation

y=ea+bx1+ea+bx

which simplies to

y=11+e(a+bx)

y(1+e(a+bx))=1

y+y(e(a+bx))=1

y(e(a+bx))=1y

(e(a+bx))=1yy

(a+bx)=ln1yy

a+bx=ln1yy

a+bx=lny1y

a+bx=logit(y)

status_new<-1/(1+exp(-(-1.66863+0.01191*(interest))))
status_new
 [1] 0.5700646 0.5892321 0.6871662 0.6064502 0.6097868 0.6951162 0.6284120
 [8] 0.5285395 0.5626716 0.5767543 0.6697813 0.6227713 0.6250605 0.6087318
[15] 0.5703351 0.6987311 0.6304640 0.4867960 0.6416845 0.5751760
model_fit<-glm(status~interest,data=df,family = binomial(link="logit"))
summary(model_fit)

Call:
glm(formula = status ~ interest, family = binomial(link = "logit"), 
    data = df)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)  
(Intercept) -10.97940    5.93445  -1.850   0.0643 .
interest      0.06032    0.03302   1.827   0.0677 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 27.526  on 19  degrees of freedom
Residual deviance: 22.930  on 18  degrees of freedom
AIC: 26.93

Number of Fisher Scoring iterations: 4

a+bx=lnp1p

a+bx=ln(odds)

ea+bx=odds

options(scipen = 999)
model_fit<-glm(status~interest,data=df,family = binomial(link="logit"))
tidy(model_fit,exponentiate = TRUE)
# A tibble: 2 × 5
  term         estimate std.error statistic p.value
  <chr>           <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept) 0.0000170    5.93       -1.85  0.0643
2 interest    1.06         0.0330      1.83  0.0677
library(tidyverse)
cust<-readr::read_csv("customer_data.csv") |> 
  select(gender,tenure,SeniorCitizen,InternetService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn)
Rows: 7043 Columns: 21
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (17): customerID, gender, Partner, Dependents, PhoneService, MultipleLin...
dbl  (4): SeniorCitizen, tenure, MonthlyCharges, TotalCharges

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
cust
# A tibble: 7,043 × 9
   gender tenure SeniorCitizen InternetService Contract       PaymentMethod     
   <chr>   <dbl>         <dbl> <chr>           <chr>          <chr>             
 1 Female      1             0 DSL             Month-to-month Electronic check  
 2 Male       34             0 DSL             One year       Mailed check      
 3 Male        2             0 DSL             Month-to-month Mailed check      
 4 Male       45             0 DSL             One year       Bank transfer (au…
 5 Female      2             0 Fiber optic     Month-to-month Electronic check  
 6 Female      8             0 Fiber optic     Month-to-month Electronic check  
 7 Male       22             0 Fiber optic     Month-to-month Credit card (auto…
 8 Female     10             0 DSL             Month-to-month Mailed check      
 9 Female     28             0 Fiber optic     Month-to-month Electronic check  
10 Male       62             0 DSL             One year       Bank transfer (au…
# ℹ 7,033 more rows
# ℹ 3 more variables: MonthlyCharges <dbl>, TotalCharges <dbl>, Churn <chr>
clean_data<-cust |> 
  mutate_if(is.character,as.factor)
clean_data
# A tibble: 7,043 × 9
   gender tenure SeniorCitizen InternetService Contract       PaymentMethod     
   <fct>   <dbl>         <dbl> <fct>           <fct>          <fct>             
 1 Female      1             0 DSL             Month-to-month Electronic check  
 2 Male       34             0 DSL             One year       Mailed check      
 3 Male        2             0 DSL             Month-to-month Mailed check      
 4 Male       45             0 DSL             One year       Bank transfer (au…
 5 Female      2             0 Fiber optic     Month-to-month Electronic check  
 6 Female      8             0 Fiber optic     Month-to-month Electronic check  
 7 Male       22             0 Fiber optic     Month-to-month Credit card (auto…
 8 Female     10             0 DSL             Month-to-month Mailed check      
 9 Female     28             0 Fiber optic     Month-to-month Electronic check  
10 Male       62             0 DSL             One year       Bank transfer (au…
# ℹ 7,033 more rows
# ℹ 3 more variables: MonthlyCharges <dbl>, TotalCharges <dbl>, Churn <fct>

conditional statements

  • the function ifelse() is a conditional statement used to create new varables
  • the function takes the following arguments

ifelse(condition,value when true, otherwise)

clean_data$status<-ifelse(clean_data$Churn=="Yes",1,0)
clean_data
# A tibble: 7,043 × 10
   gender tenure SeniorCitizen InternetService Contract       PaymentMethod     
   <fct>   <dbl>         <dbl> <fct>           <fct>          <fct>             
 1 Female      1             0 DSL             Month-to-month Electronic check  
 2 Male       34             0 DSL             One year       Mailed check      
 3 Male        2             0 DSL             Month-to-month Mailed check      
 4 Male       45             0 DSL             One year       Bank transfer (au…
 5 Female      2             0 Fiber optic     Month-to-month Electronic check  
 6 Female      8             0 Fiber optic     Month-to-month Electronic check  
 7 Male       22             0 Fiber optic     Month-to-month Credit card (auto…
 8 Female     10             0 DSL             Month-to-month Mailed check      
 9 Female     28             0 Fiber optic     Month-to-month Electronic check  
10 Male       62             0 DSL             One year       Bank transfer (au…
# ℹ 7,033 more rows
# ℹ 4 more variables: MonthlyCharges <dbl>, TotalCharges <dbl>, Churn <fct>,
#   status <dbl>
ggplot(clean_data,aes(x=Churn,fill=Contract))+
  geom_bar()

Model building

null_model<-glm(status~1,data=clean_data,family=binomial)
summary(null_model)

Call:
glm(formula = status ~ 1, family = binomial, data = clean_data)

Coefficients:
            Estimate Std. Error z value            Pr(>|z|)    
(Intercept) -1.01824    0.02699  -37.73 <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 8150.1  on 7042  degrees of freedom
Residual deviance: 8150.1  on 7042  degrees of freedom
AIC: 8152.1

Number of Fisher Scoring iterations: 4
AIC(null_model)
[1] 8152.146
full_model<-glm(status~.-Churn,data=clean_data,family=binomial)
summary(full_model)

Call:
glm(formula = status ~ . - Churn, family = binomial, data = clean_data)

Coefficients:
                                       Estimate Std. Error z value
(Intercept)                          -0.0777537  0.2172993  -0.358
genderMale                           -0.0182879  0.0638132  -0.287
tenure                               -0.0601592  0.0062346  -9.649
SeniorCitizen                         0.3376849  0.0812063   4.158
InternetServiceFiber optic            0.9632804  0.1306020   7.376
InternetServiceNo                    -0.9545950  0.1573155  -6.068
ContractOne year                     -0.7616137  0.1049811  -7.255
ContractTwo year                     -1.5760053  0.1735733  -9.080
PaymentMethodCredit card (automatic) -0.0718050  0.1122814  -0.640
PaymentMethodElectronic check         0.4147766  0.0926149   4.479
PaymentMethodMailed check            -0.1110753  0.1127085  -0.986
MonthlyCharges                       -0.0045429  0.0034972  -1.299
TotalCharges                          0.0003421  0.0000706   4.845
                                                 Pr(>|z|)    
(Intercept)                                         0.720    
genderMale                                          0.774    
tenure                               < 0.0000000000000002 ***
SeniorCitizen                           0.000032054208236 ***
InternetServiceFiber optic              0.000000000000163 ***
InternetServiceNo                       0.000000001294893 ***
ContractOne year                        0.000000000000402 ***
ContractTwo year                     < 0.0000000000000002 ***
PaymentMethodCredit card (automatic)                0.522    
PaymentMethodElectronic check           0.000007516726374 ***
PaymentMethodMailed check                           0.324    
MonthlyCharges                                      0.194    
TotalCharges                            0.000001266940554 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 8143.4  on 7031  degrees of freedom
Residual deviance: 5966.0  on 7019  degrees of freedom
  (11 observations deleted due to missingness)
AIC: 5992

Number of Fisher Scoring iterations: 6
tenure_model<-glm(status~tenure,data=clean_data,family=binomial)
summary(tenure_model)

Call:
glm(formula = status ~ tenure, family = binomial, data = clean_data)

Coefficients:
             Estimate Std. Error z value            Pr(>|z|)    
(Intercept)  0.027739   0.042225   0.657               0.511    
tenure      -0.038783   0.001406 -27.593 <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 8150.1  on 7042  degrees of freedom
Residual deviance: 7191.4  on 7041  degrees of freedom
AIC: 7195.4

Number of Fisher Scoring iterations: 4
anova(null_model,tenure_model,test = "LRT")
Analysis of Deviance Table

Model 1: status ~ 1
Model 2: status ~ tenure
  Resid. Df Resid. Dev Df Deviance              Pr(>Chi)    
1      7042     8150.1                                      
2      7041     7191.4  1   958.71 < 0.00000000000000022 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model3<-glm(status~tenure+Contract,data=clean_data,family=binomial)
summary(model3)

Call:
glm(formula = status ~ tenure + Contract, family = binomial, 
    data = clean_data)

Coefficients:
                   Estimate Std. Error z value            Pr(>|z|)    
(Intercept)      -0.0007934  0.0437942  -0.018               0.986    
tenure           -0.0166984  0.0017125  -9.751 <0.0000000000000002 ***
ContractOne year -1.3999556  0.0956915 -14.630 <0.0000000000000002 ***
ContractTwo year -2.6359020  0.1609943 -16.373 <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 8150.1  on 7042  degrees of freedom
Residual deviance: 6665.1  on 7039  degrees of freedom
AIC: 6673.1

Number of Fisher Scoring iterations: 6
anova(tenure_model,model3,test = "LRT")
Analysis of Deviance Table

Model 1: status ~ tenure
Model 2: status ~ tenure + Contract
  Resid. Df Resid. Dev Df Deviance              Pr(>Chi)    
1      7041     7191.4                                      
2      7039     6665.1  2   526.33 < 0.00000000000000022 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model4<-glm(status~tenure+Contract+gender,data=clean_data,family=binomial)
summary(model4)

Call:
glm(formula = status ~ tenure + Contract + gender, family = binomial, 
    data = clean_data)

Coefficients:
                  Estimate Std. Error z value            Pr(>|z|)    
(Intercept)       0.020373   0.053091   0.384               0.701    
tenure           -0.016699   0.001713  -9.751 <0.0000000000000002 ***
ContractOne year -1.399564   0.095699 -14.625 <0.0000000000000002 ***
ContractTwo year -2.636360   0.160991 -16.376 <0.0000000000000002 ***
genderMale       -0.042110   0.059718  -0.705               0.481    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 8150.1  on 7042  degrees of freedom
Residual deviance: 6664.6  on 7038  degrees of freedom
AIC: 6674.6

Number of Fisher Scoring iterations: 6
anova(model3,model4,test = "LRT")
Analysis of Deviance Table

Model 1: status ~ tenure + Contract
Model 2: status ~ tenure + Contract + gender
  Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1      7039     6665.1                     
2      7038     6664.6  1  0.49725   0.4807
model5<-glm(status~tenure+Contract+MonthlyCharges,data=clean_data,family=binomial)
summary(model5)

Call:
glm(formula = status ~ tenure + Contract + MonthlyCharges, family = binomial, 
    data = clean_data)

Coefficients:
                  Estimate Std. Error z value            Pr(>|z|)    
(Intercept)      -1.602659   0.089672  -17.87 <0.0000000000000002 ***
tenure           -0.035739   0.002041  -17.51 <0.0000000000000002 ***
ContractOne year -1.062224   0.100542  -10.56 <0.0000000000000002 ***
ContractTwo year -2.021601   0.166171  -12.17 <0.0000000000000002 ***
MonthlyCharges    0.028619   0.001349   21.21 <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 8150.1  on 7042  degrees of freedom
Residual deviance: 6145.8  on 7038  degrees of freedom
AIC: 6155.8

Number of Fisher Scoring iterations: 6
options(scipen=999)
anova(model4,model5,test = "LRT")
Analysis of Deviance Table

Model 1: status ~ tenure + Contract + gender
Model 2: status ~ tenure + Contract + MonthlyCharges
  Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1      7038     6664.6                     
2      7038     6145.8  0   518.82         
model6<-glm(status~tenure+Contract+MonthlyCharges+PaymentMethod,data=clean_data,family=binomial)
summary(model6)

Call:
glm(formula = status ~ tenure + Contract + MonthlyCharges + PaymentMethod, 
    family = binomial, data = clean_data)

Coefficients:
                                      Estimate Std. Error z value
(Intercept)                          -1.592074   0.124121 -12.827
tenure                               -0.034540   0.002082 -16.586
ContractOne year                     -0.964429   0.101512  -9.501
ContractTwo year                     -1.858607   0.167613 -11.089
MonthlyCharges                        0.025172   0.001426  17.651
PaymentMethodCredit card (automatic) -0.066647   0.111533  -0.598
PaymentMethodElectronic check         0.486909   0.091804   5.304
PaymentMethodMailed check            -0.129660   0.109946  -1.179
                                                 Pr(>|z|)    
(Intercept)                          < 0.0000000000000002 ***
tenure                               < 0.0000000000000002 ***
ContractOne year                     < 0.0000000000000002 ***
ContractTwo year                     < 0.0000000000000002 ***
MonthlyCharges                       < 0.0000000000000002 ***
PaymentMethodCredit card (automatic)                0.550    
PaymentMethodElectronic check                 0.000000113 ***
PaymentMethodMailed check                           0.238    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 8150.1  on 7042  degrees of freedom
Residual deviance: 6075.1  on 7035  degrees of freedom
AIC: 6091.1

Number of Fisher Scoring iterations: 6
options(scipen=999)
anova(model5,model6,test = "LRT")
Analysis of Deviance Table

Model 1: status ~ tenure + Contract + MonthlyCharges
Model 2: status ~ tenure + Contract + MonthlyCharges + PaymentMethod
  Resid. Df Resid. Dev Df Deviance             Pr(>Chi)    
1      7038     6145.8                                     
2      7035     6075.1  3   70.718 0.000000000000002995 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model7<-glm(status~tenure+Contract+MonthlyCharges+PaymentMethod+InternetService,data=clean_data,family=binomial)
summary(model7)

Call:
glm(formula = status ~ tenure + Contract + MonthlyCharges + PaymentMethod + 
    InternetService, family = binomial, data = clean_data)

Coefficients:
                                      Estimate Std. Error z value
(Intercept)                          -0.648118   0.175485  -3.693
tenure                               -0.031323   0.002110 -14.842
ContractOne year                     -0.801472   0.103679  -7.730
ContractTwo year                     -1.612038   0.170279  -9.467
MonthlyCharges                        0.003748   0.003007   1.246
PaymentMethodCredit card (automatic) -0.066563   0.112079  -0.594
PaymentMethodElectronic check         0.446015   0.092411   4.826
PaymentMethodMailed check            -0.070081   0.111345  -0.629
InternetServiceFiber optic            0.959376   0.127824   7.505
InternetServiceNo                    -0.803637   0.151449  -5.306
                                                 Pr(>|z|)    
(Intercept)                                      0.000221 ***
tenure                               < 0.0000000000000002 ***
ContractOne year                       0.0000000000000107 ***
ContractTwo year                     < 0.0000000000000002 ***
MonthlyCharges                                   0.212698    
PaymentMethodCredit card (automatic)             0.552584    
PaymentMethodElectronic check          0.0000013900527627 ***
PaymentMethodMailed check                        0.529083    
InternetServiceFiber optic             0.0000000000000612 ***
InternetServiceNo                      0.0000001118522771 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 8150.1  on 7042  degrees of freedom
Residual deviance: 6010.4  on 7033  degrees of freedom
AIC: 6030.4

Number of Fisher Scoring iterations: 6
options(scipen=999)
anova(model6,model7,test = "LRT")
Analysis of Deviance Table

Model 1: status ~ tenure + Contract + MonthlyCharges + PaymentMethod
Model 2: status ~ tenure + Contract + MonthlyCharges + PaymentMethod + 
    InternetService
  Resid. Df Resid. Dev Df Deviance             Pr(>Chi)    
1      7035     6075.1                                     
2      7033     6010.4  2   64.719 0.000000000000008838 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1