Title: | Speed Up Exploratory Data Analysis (EDA) with the Correlation Funnel |
---|---|
Description: | Speeds up exploratory data analysis (EDA) by providing a succinct workflow and interactive visualization tools for understanding which features have relationships to target (response). Uses binary correlation analysis to determine relationship. Default correlation method is the Pearson method. Lian Duan, W Nick Street, Yanchi Liu, Songhua Xu, and Brook Wu (2014) <doi:10.1145/2637484>. |
Authors: | Matt Dancho [aut, cre] |
Maintainer: | Matt Dancho <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.2.0 |
Built: | 2024-11-11 04:31:06 UTC |
Source: | https://github.com/business-science/correlationfunnel |
binarize
returns the binary data coverted from data in normal (numeric and categorical) format.
binarize( data, n_bins = 4, thresh_infreq = 0.01, name_infreq = "-OTHER", one_hot = TRUE )
binarize( data, n_bins = 4, thresh_infreq = 0.01, name_infreq = "-OTHER", one_hot = TRUE )
data |
A |
n_bins |
The number of bins to for converting continuous (numeric features) into discrete features (bins) |
thresh_infreq |
The threshold for converting categorical (character or factor features) into an "Other" Category. |
name_infreq |
The name for infrequently appearing categories to be lumped into. Set to "-OTHER" by default. |
one_hot |
If set to |
The binned format helps correlation analysis to identify non-linear trends between a predictor (binned values) and a response (the target)
The binarize()
function takes data in a "normal" format and converts to a binary format that is useful as a preparation
step before using correlate()
:
Numeric Features:
The "Normal Data" format has numeric features that are continuous values in numeric format (double
or integer
).
The binarize()
function converts these to bins (categories) and then discretizes the bins using a one-hot encoding process.
Categorical Features:
The "Normal Data" format has categorical features that are character
or factor
format.
The binarize()
function converts these to binary features using a one-hot encoding process.
A tbl
library(dplyr) library(correlationfunnel) marketing_campaign_tbl %>% select(-ID) %>% binarize()
library(dplyr) library(correlationfunnel) marketing_campaign_tbl %>% select(-ID) %>% binarize()
correlate
returns a correlation between a target column and the features in a data set.
correlate(data, target, ...)
correlate(data, target, ...)
data |
A |
target |
The feature that contains the response (Target) that you want to measure relationship. |
... |
Other arguments passed to cor |
The correlate()
function provides a convient wrapper around the cor function where the target
is the column containing the Y variable. The function is intended to be used with binarize()
, which enables
creation of the binary correlation analysis, which is the feed data for the plot_correlation_funnel()
visualization.
The default method is the Pearson correlation, which is the Correlation Coefficient from L. Duan et al., 2014. This represents the linear relationship between two dichotomous features (binary variables). Learn more about the binary correlation approach in the Vignette covering the Methodology, Key Considerations and FAQs.
A tbl
Lian Duan, W. Nick Street, Yanchi Liu, Songhua Xu, and Brook Wu. 2014. Selecting the right correlation measure for binary data. ACM Trans. Knowl. Discov. Data 9, 2, Article 13 (September 2014), 28 pages. DOI: http://dx.doi.org/10.1145/2637484
binarize()
, plot_correlation_funnel()
library(dplyr) library(correlationfunnel) marketing_campaign_tbl %>% select(-ID) %>% binarize() %>% correlate(TERM_DEPOSIT__yes)
library(dplyr) library(correlationfunnel) marketing_campaign_tbl %>% select(-ID) %>% binarize() %>% correlate(TERM_DEPOSIT__yes)
A dataset containing data related to telecom customers that have enrolled in various products and services
customer_churn_tbl
customer_churn_tbl
An object of class spec_tbl_df
(inherits from tbl_df
, tbl
, data.frame
) with 7043 rows and 21 columns.
customerID (chr): CUSTOMER ID
gender (chr): Customer's gender ("Female", "Male")
SeniorCitizen (dbl): 1 = Senior Citzen, 0 = Not Senior Citizen
Partner (chr): Whether the customer has a partner or not (Yes, No)
Dependents (chr): Whether the customer has dependents or not (Yes, No)
tenure (dbl): Number of months the customer has stayed with the company
PhoneService (chr): Whether the customer has a phone service or not (Yes, No)
MultipleLines (chr): Whether the customer has multiple lines or not (Yes, No, No phone service)
InternetService (chr): Customer’s internet service provider (DSL, Fiber optic, No)
OnlineSecurity (chr): Whether the customer has online security or not (Yes, No, No internet service)
OnlineBackup (chr): Whether the customer has online backup or not (Yes, No, No internet service)
DeviceProtection (chr): Whether the customer has device protection or not (Yes, No, No internet service)
TechSupport (chr): Whether the customer has tech support or not (Yes, No, No internet service)
StreamingTV (chr): Whether the customer has streaming TV or not (Yes, No, No internet service)
StreamingMovies (chr): Whether the customer has streaming movies or not (Yes, No, No internet service)
Contract (chr): The contract term of the customer (Month-to-month, One year, Two year)
PaperlessBilling (chr): Whether the customer has paperless billing or not (Yes, No)
PaymentMethod (chr): The customer’s payment method (Electronic check, Mailed check, Bank transfer (automatic), Credit card (automatic))
MonthlyCharges (dbl): The amount charged to the customer monthly
TotalCharges (dbl): The total amount charged to the customer
Churn (chr): Outcome. Whether the customer churned or not (Yes or No)
A dataset containing data related to bank clients, last contact of the current marketing campaign, and attributes related to a previous marketing campaign.
marketing_campaign_tbl
marketing_campaign_tbl
An object of class tbl_df
(inherits from tbl
, data.frame
) with 45211 rows and 18 columns.
ID (chr): CUSTOMER ID
AGE (dbl): Customer's age
JOB (chr): Type of job (categorical: "admin.","unknown","unemployed","management","housemaid","entrepreneur","student", "blue-collar","self-employed","retired","technician","services")
MARITAL (chr): marital status (categorical: "married","divorced","single"; note: "divorced" means divorced or widowed)
EDUCATION (chr): categorical: "unknown","secondary","primary","tertiary"
DEFAULT (chr): Has credit in default? (binary: "yes","no")
BALANCE (dbl): Average yearly balance, in euros (numeric)
HOUSING (chr): Has housing loan? (binary: "yes","no")
LOAN (chr): Has personal loan? (binary: "yes","no")
CONTACT (chr): Contact communication type (categorical: "unknown","telephone","cellular")
DAY (dbl): Last contact day of the month (numeric)
MONTH (chr): Last contact month of year (categorical: "jan", "feb", "mar", ..., "nov", "dec")
DURATION (dbl): Last contact duration, in seconds (numeric)
CAMPAIGN (dbl): Number of contacts performed during this campaign and for this client (numeric, includes last contact)
PDAYS (dbl): Number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted)
PREVIOUS (dbl): Number of contacts performed before this campaign and for this client (numeric)
POUTCOME (chr): Outcome of the previous marketing campaign (categorical: "unknown","other","failure","success")
TERM_DEPOSIT (chr): Has the client subscribed a term deposit? (binary: "yes","no")
Moro et al., 2014 S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014
plot_correlation_funnel
returns a correlation funnel visualization in either static (ggplot2
) or
interactive (plotly
) formats.
plot_correlation_funnel( data, interactive = FALSE, limits = c(-1, 1), alpha = 1 )
plot_correlation_funnel( data, interactive = FALSE, limits = c(-1, 1), alpha = 1 )
data |
A |
interactive |
Returns either a static ( |
limits |
Sets the X-Axis limits for the correlation space |
alpha |
Sets the transparency of the points on the plot. |
A static ggplot2
plot or an interactive plotly
plot
library(dplyr) library(correlationfunnel) marketing_campaign_tbl %>% select(-ID) %>% binarize() %>% correlate(TERM_DEPOSIT__yes) %>% plot_correlation_funnel()
library(dplyr) library(correlationfunnel) marketing_campaign_tbl %>% select(-ID) %>% binarize() %>% correlate(TERM_DEPOSIT__yes) %>% plot_correlation_funnel()