The Dataset

The dataset contains information about various models of mobile phones from different brands.

Below are the available variables:

Product Name
Brand Name
Price
Rating
Reviews (text)
Review Votes

Data manipulation

1. Import the dataframe in R and with the use of dplyr, filter it to include prices >=22 and the Brands among the following: (“NOKIA”,“APPLE”,“SAMSUNG”,“LENOVO”,“HUAWEI”,“ACER”,“ALCATEL”,“AMAZON”,“ASUS”,“BlACKBERRY”,“SONY”,“GOOGLE”,“HTC”,“LG”,“MOTOROLA”). Use data cleaning tools to adapt the Brand Name variable to the above mentioned requirements before the filtering

2. Using dplyr count the number of cellphones by each brand in descending order.

Exploratory Data Visualization

3. Create 2 bar plots: 1) showing the average rating and 2) average price for each brand. Arrange both plots in an ascending order.

The average rating for Google and amazon is the highest since Acer rating is the lowest

The average price for Lenovo brand is the highest and Acer is the lowest

4. Using dplyr and ggplot2’s faceting, plot the distribution of ratings (with bars) for brands Apple, Samsung, HTC and Lenovo.

Lenovo & HTC rating is the lowest since Samsung rating is the first and apple comes in the second rank

5. Show the distribution of Reviews’ length (number of characters). How long reviews the users write about the mobile phones? (Zoom in with xlim() and change binwidth to clearly see the plot)

The numbers of the people who write about 15 characters are the highest

The numbers of the people who write about 60 characters and higher are the lowest

Text Mining

6. Using tm package create a Term Document Matrix for Reviews of 2 popular phones (Apple and Samsung). Because the data is huge, subset it to include only the first 10,000 reviews for each brand. Make sure to remove any number, punctuation, stopwords and stem the review. Construct 2 bar plot to show top 10 words people used in their reviews about these phones.

The top word used in apple is phone, the lowest is like

The top word used in apple is phone, the lowest is get. However, in apple graph it is obvious that the frequency of phone word usage is less than phone word usage in Samsung graph, and love word freq in Samsung is higher

7. Construct a word cloud of Apple’s reviews.

8. Solve Q7. using TFIDF weighting. Compare and make comments about how different are the 2 methods used.

The good word weight is higher than phone word

9. Create a comparison-cloud for Apple’s reviews vs. Samsung and identify the words that are dissimilar.

10. Perform sentiment analysis on cell-phone reviews and find average polarity of each brand (not only Samsung and Apple, but also the ones included in Q1). Run it on a subset of 1000 reviews and show the brands with the highest polarity score. Make sure to remove non-ASCII characters to ASCII prior to the analysis.

Arman Manjian

arman_manjian@edu.aua.am