Loading data

filename tempurl TEMP;

proc http
   url= "https://s3.amazonaws.com/talent-assets.datacamp.com/product_sales.csv"
    method="get" out=tempurl;
run;


/* Import the data from the URL into a SAS dataset */
options errors=0; /* Suppress invalid data log entries whe REVENUE values are "NA" */
data work.sales;
    infile tempurl dlm =',' dsd firstobs=2 truncover ; /* Set delimiter and skip header row */
    /* Supplay informats on the input statement to set variable lengths */
    input week:best32.  sales_method:$12.  customer_id:$36.  nb_sold:best32. revenue:best32. 
    years_as_customer:best32.  nb_site_visits:best32.  state:$13.  ;
run;
options errors=20; /* Restore default setting */

Data summary


proc means data=mydata NMISS N;
run;

proc contents data = sales;
run;

‘proc means’ with NMISS N show number of missing in variables: revenue has 1074 NA.

‘proc contents’ creat output with details as names of 8 variales in which 4 char type and 4 numeric type.

proc univariate data = sales;
run;

‘proc univariate’ creat output with summary of 4 numeric variables in which ‘revenue’ has 1074 missing, 7.16%, as ‘.’

proc freq data=sales;
    tables week sales_method state / out=sales_char(keep=var:);
run;

‘proc freq’ creat output with summary of 4 character variables in which some incorrected values of ‘sales_method’ are shown as ‘em + cal’, ‘email’.

These incorrected values should be replaced with the corrected values in next steps.

Top 10 States of customers

What are top 10 states in which customers bought products for the company.


proc freq data=sales2;
    tables State / out=freq_table outpct noprint;
run;

proc sort data=freq_table;
    by descending Count;
run;

proc print data=freq_table (obs=10);
    var State Count Percent;
    title 'Top 10 States of Customers';
run;

Data cleaning

correcting ‘sales_method’

data sales2;
    set sales;
    if sales_method = 'email' then sales_method = 'Email';
    else if sales_method = 'em + call' then sales_method = 'Email + Call' ;
    else sales_method = sales_method;
run;

check if the incorrected are fixed

proc freq data=sales2;
    tables sales_method / nocum out = sales_m;
run;

The ‘sales_method’ is now correct with 3 values as ‘Email’, ‘Email+’ and ‘Call’.

replace missing values of ‘revenue’ with the mean = 93.93

data sales_m;
set sales2;
if revenue = '.' then revenue = 93.93;
run;

Check the output, see the dataset with ‘.’ is replaced with 93.93.

replace incorrected values of ‘year_as_customer’

Two rows have value of 47 and 63, these incorrected values are replaced with the mean of 5 years.

 data sales2;
 set sales2;
 if years_as_customer > 40 then years_as_customer = 5;
 run;

other way to write code together

data sales2;
    set sales;
    if sales_method = 'email' then sales_method = 'Email';
    else if sales_method = 'em + cal' then sales_method = 'Email + Call' ;
    else sales_method = sales_method;
    
    if years_as_customer > 40 then years_as_customer = 5;
  
    if revenue = '.' then revenue = 93.93;
    
run;

Use proc sql

I have the same data called ‘sales2’ with ‘proc sql’
proc sql ;
CREATE TABLE sales2 AS
    select week, 
        CASE
            WHEN sales_method = 'em + call' THEN 'Email + Call'
            WHEN sales_method = 'email' THEN 'Email'
            ELSE sales_method
        END AS sales_method,
        customer_id,
        nb_sold,
        CASE    
            WHEN years_as_customer > 40 THEN 5
            ELSE years_as_customer
        END AS years_as_customer,
        CASE 
            WHEN revenue IS NULL THEN 93.93
            ELSE revenue
        END AS revenue,
        nb_site_visits,
        state
    from sales
; 
quit;

Graphics

Number of Customers By Sale Methods


title "Number of Customers By Sale Methods";
proc sgplot data=sales2;
    vbar sales_method / datalabel ;
    xaxis ;
    yaxis grid ;
run;
title;

BarPlot

Total revenue By Sale Methods

proc sort data=sales2;
    by sales_method;
run;

data sales_re;
    set sales2(keep=sales_method revenue);
    by sales_method;
    if first.sales_method then sum_revenue = 0;
    sum_revenue + revenue;
    if last.sales_method;
run;

1 Call 244564.82

2 Email 723415.75

3 Email + 441038

BarPlot