R Bridge Course Final Project

The goal of this project is to analyze the online sales in five countries “Australia, Belgium, France, Germany, United Kingdom”,conducted for expansion and investments purposes.

The sample size used for this project is 3% of online sales in those countries in 2010 rendomly selected from full dataset.

it is important in the project to understand the features of the data and predict the best location for expansion and future investments.

I will check the number of sales in each country using adding a column of total amount of sales in each country.

1. Data Exploration: This should include summary statistics, means, medians, quartiles, or any other relevant information about the data set. Please include some conclusions in the R Markdown text.

Load the libraries

library(ggplot2)

First we get the data, show the summary, means. medians.

url <- "https://raw.githubusercontent.com/akarimhammoud/RbridgeFinalProjectnlineRetail/master/Online%20Retail.csv"
OnlineSales <- read.csv(file= url, header=TRUE, sep=",")
summary(OnlineSales)
##   InvoiceNo          StockCode         Description           Quantity     
##  Length:513         Length:513         Length:513         Min.   : -7.00  
##  Class :character   Class :character   Class :character   1st Qu.:  4.00  
##  Mode  :character   Mode  :character   Mode  :character   Median :  8.00  
##                                                           Mean   : 16.78  
##                                                           3rd Qu.: 12.00  
##                                                           Max.   :432.00  
##  InvoiceDate          UnitPrice        CustomerID      Country         
##  Length:513         Min.   : 0.000   Min.   :12395   Length:513        
##  Class :character   1st Qu.: 1.250   1st Qu.:12567   Class :character  
##  Mode  :character   Median : 2.100   Median :12686   Mode  :character  
##                     Mean   : 3.665   Mean   :14015                     
##                     3rd Qu.: 4.250   3rd Qu.:15311                     
##                     Max.   :42.950   Max.   :18074
head(OnlineSales)
##   InvoiceNo StockCode                         Description Quantity  InvoiceDate
## 1    536365    85123A  WHITE HANGING HEART T-LIGHT HOLDER        6 12/1/10 8:26
## 2    536365     71053                 WHITE METAL LANTERN        6 12/1/10 8:26
## 3    536365    84406B      CREAM CUPID HEARTS COAT HANGER        8 12/1/10 8:26
## 4    536365    84029G KNITTED UNION FLAG HOT WATER BOTTLE        6 12/1/10 8:26
## 5    536365    84029E      RED WOOLLY HOTTIE WHITE HEART.        6 12/1/10 8:26
## 6    536365     22752        SET 7 BABUSHKA NESTING BOXES        2 12/1/10 8:26
##   UnitPrice CustomerID        Country
## 1      2.55      17850 United Kingdom
## 2      3.39      17850 United Kingdom
## 3      2.75      17850 United Kingdom
## 4      3.39      17850 United Kingdom
## 5      3.39      17850 United Kingdom
## 6      7.65      17850 United Kingdom
str(OnlineSales)
## 'data.frame':    513 obs. of  8 variables:
##  $ InvoiceNo  : chr  "536365" "536365" "536365" "536365" ...
##  $ StockCode  : chr  "85123A" "71053" "84406B" "84029G" ...
##  $ Description: chr  "WHITE HANGING HEART T-LIGHT HOLDER" "WHITE METAL LANTERN" "CREAM CUPID HEARTS COAT HANGER" "KNITTED UNION FLAG HOT WATER BOTTLE" ...
##  $ Quantity   : int  6 6 8 6 6 2 6 6 6 32 ...
##  $ InvoiceDate: chr  "12/1/10 8:26" "12/1/10 8:26" "12/1/10 8:26" "12/1/10 8:26" ...
##  $ UnitPrice  : num  2.55 3.39 2.75 3.39 3.39 7.65 4.25 1.85 1.85 1.69 ...
##  $ CustomerID : int  17850 17850 17850 17850 17850 17850 17850 17850 17850 13047 ...
##  $ Country    : chr  "United Kingdom" "United Kingdom" "United Kingdom" "United Kingdom" ...

2. Data wrangling: Please perform some basic transformations. They will need to make sense but could include column renaming, creating a subset of the data, replacing values, or creating new columns with derived data (for example – if it makes sense you could sum two columns together)

Create new frame of data and called “mySets”, with only five columns and rename the “Description Column” to “Details”, and the “InvoiceDate Column” to “date”.

mySets <- OnlineSales[ c("Description", "Quantity", "InvoiceDate", "UnitPrice", "Country")]
colnames(mySets) <- c("Details", "Quantity", "Date", "UnitPrice", "Country")
head(mySets)
##                               Details Quantity         Date UnitPrice
## 1  WHITE HANGING HEART T-LIGHT HOLDER        6 12/1/10 8:26      2.55
## 2                 WHITE METAL LANTERN        6 12/1/10 8:26      3.39
## 3      CREAM CUPID HEARTS COAT HANGER        8 12/1/10 8:26      2.75
## 4 KNITTED UNION FLAG HOT WATER BOTTLE        6 12/1/10 8:26      3.39
## 5      RED WOOLLY HOTTIE WHITE HEART.        6 12/1/10 8:26      3.39
## 6        SET 7 BABUSHKA NESTING BOXES        2 12/1/10 8:26      7.65
##          Country
## 1 United Kingdom
## 2 United Kingdom
## 3 United Kingdom
## 4 United Kingdom
## 5 United Kingdom
## 6 United Kingdom

Replacing ‘United Kingdom’ to ‘UK’ in the data of the Country Column.

mySets$Country <- sub("*United Kingdom", "UK", mySets$Country)
head(mySets)
##                               Details Quantity         Date UnitPrice Country
## 1  WHITE HANGING HEART T-LIGHT HOLDER        6 12/1/10 8:26      2.55      UK
## 2                 WHITE METAL LANTERN        6 12/1/10 8:26      3.39      UK
## 3      CREAM CUPID HEARTS COAT HANGER        8 12/1/10 8:26      2.75      UK
## 4 KNITTED UNION FLAG HOT WATER BOTTLE        6 12/1/10 8:26      3.39      UK
## 5      RED WOOLLY HOTTIE WHITE HEART.        6 12/1/10 8:26      3.39      UK
## 6        SET 7 BABUSHKA NESTING BOXES        2 12/1/10 8:26      7.65      UK

Here I want to add new column and call it “Amount” to calculate the total amount of sales by multiplying the UnitePrice column with Quantity column.

mySets["Amount"] <- mySets$Quantity * mySets$UnitPrice
head(mySets)
##                               Details Quantity         Date UnitPrice Country
## 1  WHITE HANGING HEART T-LIGHT HOLDER        6 12/1/10 8:26      2.55      UK
## 2                 WHITE METAL LANTERN        6 12/1/10 8:26      3.39      UK
## 3      CREAM CUPID HEARTS COAT HANGER        8 12/1/10 8:26      2.75      UK
## 4 KNITTED UNION FLAG HOT WATER BOTTLE        6 12/1/10 8:26      3.39      UK
## 5      RED WOOLLY HOTTIE WHITE HEART.        6 12/1/10 8:26      3.39      UK
## 6        SET 7 BABUSHKA NESTING BOXES        2 12/1/10 8:26      7.65      UK
##   Amount
## 1  15.30
## 2  20.34
## 3  22.00
## 4  20.34
## 5  20.34
## 6  15.30

We porject that the total Amount of sales will increase at least 10% next year, we increase the total Amount of sales by 10% and create new column for the AmountNextYear “Amount” Multipled by 10%.

mySets$AmountNextYear <- mySets$Amount * 1.10
head(mySets)
##                               Details Quantity         Date UnitPrice Country
## 1  WHITE HANGING HEART T-LIGHT HOLDER        6 12/1/10 8:26      2.55      UK
## 2                 WHITE METAL LANTERN        6 12/1/10 8:26      3.39      UK
## 3      CREAM CUPID HEARTS COAT HANGER        8 12/1/10 8:26      2.75      UK
## 4 KNITTED UNION FLAG HOT WATER BOTTLE        6 12/1/10 8:26      3.39      UK
## 5      RED WOOLLY HOTTIE WHITE HEART.        6 12/1/10 8:26      3.39      UK
## 6        SET 7 BABUSHKA NESTING BOXES        2 12/1/10 8:26      7.65      UK
##   Amount AmountNextYear
## 1  15.30         16.830
## 2  20.34         22.374
## 3  22.00         24.200
## 4  20.34         22.374
## 5  20.34         22.374
## 6  15.30         16.830

3. Graphics: Please make sure to display at least one scatter plot, box plot and histogram. Don’t be limited to this. Please explore the many other options in R packages such as ggplot2.

Boxplot of Transactions spreads of Amount spent on online purchases per Country, We notice people are willing to spend reletivley higher Purchases Amounts in the UK.

ggplot(mySets, aes(y = Amount,x = Country, fill= Amount)) + geom_boxplot()+ggtitle("Boxplot of Amount spreads of online purchases per Country.")+theme_classic()+xlab("Countries")

Boxplot of Unit Price spreads of online purchases per Country.

ggplot(mySets, aes(y = UnitPrice,x = Country, fill= UnitPrice)) + geom_boxplot()+ggtitle("Boxplot of Unit Price spreads of online purchases per Country.")+theme_classic()+xlab("Country")

Using histogram we want to check the frequency of the Unit Prices for the items that been sold in those countries, we notice the majorty of the items has unit prices less than $10.

hist(mySets$UnitPrice, breaks= 10, xlim = c(0, 50), ylim = c(0, 500), xlab = "UnitPrice", main = "R Histogram \nUnit Price", col = "red")

Now to make a Density Plot

hist(mySets$UnitPrice, freq = FALSE, main = "Density Plot of the Unit Prices in the Study")

New we add the distribution curve for the unit prices by adding aesthetics.

hist(mySets$UnitPrice, freq = FALSE, xlab = "Unit Price", main = "Density Plot of the A Unit Price per Dollar in this Study", col="lightblue")
curve(dnorm(x, mean=mean(mySets$UnitPrice), sd=sd(mySets$UnitPrice)), add=TRUE, col="darkred", lwd=2)

Now I want to check how much People are welling to buy online using historgram and ggplot2.

A <- ggplot(mySets, aes(x=Amount))
B <- A + geom_histogram(binwidth = 1, color='red',fill='pink', alpha = 0.4)
C <- B + xlab('Amount of sales') + ylab('Count')
print(C + ggtitle("Count of the Total Amount of sales"))

Now I want to check the density of the Amounts spent on the internet in 2010.

ggplot(data = mySets) + geom_density(aes(x = Amount), fill = "grey50")

Now I want to check the Amounts spent each of the five countries we have on the list using ggplot with line data.

ggplot(mySets, aes(x = Country, y = Amount)) + geom_line()

Scatter plot using Country and Amount variables

ggplot(mySets, aes(x = Country, y = Amount))+ geom_point()

Scatter plot using Country, Amount, and Unit Price variables

graph <- ggplot(mySets, aes(x= Country, y = Amount)) + geom_line(color = "red") + geom_point()
graph <- graph + geom_line(aes(x = Country, y = UnitPrice), color = "green")
graph

Using Graph scatter plot for Amounts spent each of the five countries using geom_point and geom_point with the opposit axis.

ggplot(mySets, aes(x = Amount, y = Country)) + geom_point(na.rm=TRUE)+geom_smooth(method=lm,se=FALSE, na.rm=TRUE)
## `geom_smooth()` using formula 'y ~ x'

  1. Meaningful question for analysis: Please state at the beginning a meaningful question for analysis. Use the first three steps and anything else that would be helpful to answer the question you are posing from the data set you chose. Please write a brief conclusion paragraph in R markdown at the end.

Conclusion:

After analyzing the online sales in five different countries “Australia, Belgium, France, Germany, United Kingdom”, its clearly that people are willing to buy stuff online 2010 in the UK more than any other country in the study. The majority of the items and units prices sold online are less than $10, this means the cheaper items with less than 10 dollars the more likely to be sold online, but in Australia the unit prices of online purchases are little higher than than the other countries. Finally I advise to expand in the UK as the country with the highest numbers of online sales by providing more low prices products, and in order to invest in the other countires there must be more programs to encourage buyers to buy products online.