Project 2

Dataset 1: Laptop Dataset

This dataset was gotten from this link: https://www.kaggle.com/datasets/ehtishamsadiq/uncleaned-laptop-price-dataset/data

First we will import it from github and remove the blank rows.

library(tidyr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(stringr)
library(ggplot2)

Laptops = read.csv('https://raw.githubusercontent.com/bwolin99/TestRepo/refs/heads/main/Project2/laptopData.csv')
Laptops = Laptops %>% na.omit()

Next we want to alter the CPU column to just display the amount of giga hertz that each CPU is capable of producing.

Laptops$Cpu = str_extract(Laptops$Cpu, "(.)(.)(.)(.)(.)(GHz)" )
Laptops$Cpu = str_extract(Laptops$Cpu, "(?= )(.*)(GHz)" )
Laptops$Cpu = gsub("V6","",Laptops$Cpu)
Laptops$Cpu = as.numeric(gsub("[^0-9.<>]", "", Laptops$Cpu))

Now we can use boxplots to compare the different processing speeds for different types of laptops.

ggplot(Laptops, aes(x=TypeName, y=Cpu)) + 
  geom_boxplot()

From this we can see that gaming devices and worksations have the best processing speed on average. While Netbooks are far behind the norm when it comes to processing speed. This means that if you want a computer with faster processing times a gaming or workstation laptop is the way to go.

Dataset 2: Development Indicators

Dataset Link: https://databank.worldbank.org/source/world-development-indicators/preview/on#

This dataset contains many indicators for growth in a country, specifically looking at the G20.

Indications = read.csv('https://raw.githubusercontent.com/bwolin99/TestRepo/refs/heads/main/Project2/DevelopmentIndicators.csv')
head(Indications)

##   Country.Name Country.Code
## 1    Argentina          ARG
## 2    Argentina          ARG
## 3    Argentina          ARG
## 4    Argentina          ARG
## 5    Argentina          ARG
## 6    Argentina          ARG
##                                                            Series.Name
## 1        Adolescent fertility rate (births per 1,000 women ages 15-19)
## 2           Agriculture, forestry, and fishing, value added (% of GDP)
## 3       Annual freshwater withdrawals, total (% of internal resources)
## 4                 Births attended by skilled health staff (% of total)
## 5                               CO2 emissions (metric tons per capita)
## 6 Contraceptive prevalence, any method (% of married women ages 15-49)
##      Series.Code X2014..YR2014. X2015..YR2015. X2016..YR2016. X2017..YR2017.
## 1    SP.ADO.TFRT         67.791         65.395         61.852         57.783
## 2 NV.AGR.TOTL.ZS    6.712703514    5.156685902     6.26456582    5.231622377
## 3 ER.H2O.FWTL.ZS    12.90753425    12.90753425    12.90753425    12.90753425
## 4 SH.STA.BRTC.ZS           99.6           99.6           98.4           93.9
## 5 EN.ATM.CO2E.PC    4.209111895    4.301913806    4.201815869    4.070111687
## 6 SP.DYN.CONU.ZS             ..             ..             ..             ..
##   X2018..YR2018. X2019..YR2019. X2020..YR2020. X2021..YR2021. X2022..YR2022.
## 1         51.029         46.153         39.866         39.065         37.932
## 2    4.537878897    5.318555997    6.357033676    7.306308855    6.639898278
## 3    12.90753425    12.90753425    12.90753425             ..             ..
## 4           99.5           99.6           98.8             ..             ..
## 5    3.975650744    3.742029812     3.40561754             ..             ..
## 6             ..             ..           70.1             ..             ..
##   X2023..YR2023.
## 1             ..
## 2    6.059508763
## 3             ..
## 4             ..
## 5             ..
## 6             ..

We will be tranforming this data to compare countries in a specific year. To do this a dataframe displaying each countries GDP, GDP growth, and Population. We will then make columns for GDP per capita.

## First select only the necessary columns
Ind_Clean = data.frame(Indications$Country.Name, Indications$Series.Name, Indications$X2023..YR2023)
names(Ind_Clean) = c('Country','Questions','Amount')

## Then we will spread the data so we can make columns out of the indicating questions. Next we select the necessary columns
Ind_Clean = spread(Ind_Clean,Questions,Amount)
Ind_Final = data.frame(Ind_Clean$Country,Ind_Clean$`GDP (current US$)`,Ind_Clean$`GDP growth (annual %)`,Ind_Clean$`Population, total`)
names(Ind_Final) = c('Country','GDP','GDP_Growth','Population')

## Finally we create our calculated column and rank the countris based on that
Ind_Final$GDP = as.numeric(Ind_Final$GDP)
Ind_Final$Population = as.numeric(Ind_Final$Population)
Ind_Final = mutate(Ind_Final, GDP_per_cap = Ind_Final$GDP/Ind_Final$Population)
Ind_Final = Ind_Final %>% arrange(desc(GDP_per_cap))
Ind_Final

##               Country         GDP   GDP_Growth Population GDP_per_cap
## 1         Switzerland 8.84940e+11  0.716066869    8849852   99994.893
## 2       United States 2.73609e+13  2.542700299  334914895   81695.083
## 3           Australia 1.72383e+12  3.016988104   26638544   64711.870
## 4         Netherlands 1.11812e+12  0.116008635   17879488   62536.466
## 5             Germany 4.45608e+12 -0.304934576   84482267   52745.744
## 6      United Kingdom 3.34003e+12  0.104017849   68350000   48866.569
## 7              France 3.03090e+12  0.703718547   68170228   44460.758
## 8               Italy 2.25485e+12  0.920692214   58761146   38373.145
## 9               Japan 4.21295e+12  1.923055531  124516650   33834.431
## 10        Korea, Rep. 1.71279e+12  1.356733243   51712619   33121.316
## 11              Spain 1.58069e+12   2.50329436   48373336   32676.886
## 12       Saudi Arabia 1.06758e+12 -0.754914811   36947025   28894.884
## 13 Russian Federation 2.02142e+12          3.6  143826130   14054.609
## 14             Mexico 1.78889e+12  3.228736756  128455567   13926.138
## 15          Argentina 6.40591e+11 -1.550501536   46654581   13730.506
## 16            Turkiye 1.10802e+12  4.516866508   85326000   12985.725
## 17              China 1.77948e+13          5.2 1410710000   12614.074
## 18             Brazil 2.17367e+12  2.908480487  216422446   10043.644
## 19          Indonesia 1.37117e+12  5.048105771  277534122    4940.546
## 20              India 3.54992e+12  7.583971124 1428627663    2484.846

From this tranformation we can see the countries with the smallest GDP per capita are Brazil, Indonesia, and India. The countries with the highest GDP per capita are Switzerland, the US, and Argentina.

Dataset 3: Movie Reviews

For this data set we will take the 1’s and 0’s method of denoting a movies catagory and condense it into one column.

Movies = read.csv('https://raw.githubusercontent.com/hms-dbmi/UpSetR/refs/heads/master/inst/extdata/movies.csv',sep= ';')
## First we gather the genres to reduce the number if columns and turn their values into an Is_category metric. Then we can filter out all of the rows that are not a selected genre of the movie
Movies_Clean = gather(Movies,'Categroies','Is_Category',3:19)
Movies_Clean = Movies_Clean[Movies_Clean$Is_Category != 0,]

## Now we will select the desired columns and group the category columns
Genre_List = list()
for (i in Movies_Clean$Name){
  genres = ''
  df = Movies_Clean[Movies_Clean$Name == i,]
  for (x in df$Categroies){
    genres = paste(genres,x,sep = ", ")
  }
  Genre_List[[length(Genre_List)+1]] = genres
}
Movies_Clean$Genre =  Genre_List 
Movies_Clean$Genre = substring(Movies_Clean$Genre, 3)

## Next we create the final table and remove duplicates
Movies_Final = data.frame(Movies_Clean$Name, Movies_Clean$ReleaseDate, Movies_Clean$Genre,Movies_Clean$AvgRating,Movies_Clean$Watches)
names(Movies_Final) = c('Names','ReleaseYear','Genres','AvgRating','Watches')
Movies_Final = Movies_Final[!duplicated(Movies_Final$Names),]
Movies_Final = Movies_Final %>% arrange(desc(AvgRating))
head(Movies_Final,20)

##                                                                  Names
## 1                                              Ulysses (Ulisse) (1954)
## 2                                              Follow the Bitch (1998)
## 3                                                 Smashing Time (1967)
## 4                                             One Little Indian (1973)
## 5                                                         Lured (1947)
## 6                                   Gate of Heavenly Peace, The (1995)
## 7                                             Bittersweet Motel (2000)
## 8                            Schlafes Bruder (Brother of Sleep) (1995)
## 9                                               Song of Freedom (1936)
## 10                                                    Baby, The (1973)
## 11                                 I Am Cuba (Soy Cuba/Ya Kuba) (1964)
## 12                                                     Lamerica (1994)
## 13                                             Apple, The (Sib) (1998)
## 14                                                      Sanjuro (1962)
## 15 Seven Samurai (The Magnificent Seven) (Shichinin no samurai) (1954)
## 16                                    Shawshank Redemption, The (1994)
## 17                                               Godfather, The (1972)
## 18                                               Close Shave, A (1995)
## 19                                          Usual Suspects, The (1995)
## 20                                          Wrong Trousers, The (1993)
##    ReleaseYear                 Genres AvgRating Watches
## 1         1954              Adventure      5.00       1
## 2         1998                 Comedy      5.00       1
## 3         1967                 Comedy      5.00       2
## 4         1973 Comedy, Drama, Western      5.00       1
## 5         1947                  Crime      5.00       1
## 6         1995            Documentary      5.00       3
## 7         2000            Documentary      5.00       1
## 8         1995                  Drama      5.00       1
## 9         1936                  Drama      5.00       1
## 10        1973                 Horror      5.00       1
## 11        1964                  Drama      4.80       5
## 12        1994                  Drama      4.75       8
## 13        1998                  Drama      4.67       9
## 14        1962      Action, Adventure      4.61      69
## 15        1954          Action, Drama      4.56     628
## 16        1994                  Drama      4.55    2227
## 17        1972   Action, Crime, Drama      4.52    2223
## 18        1995       Comedy, Thriller      4.52     657
## 19        1995        Crime, Thriller      4.52    1783
## 20        1993                 Comedy      4.51     882

Looking at the top 20 movies all of the top 10 have only 1 or 2 views. I’d say you can’t get any valuable information from that column unless it has at least 10 vies. This would make the highest rated movies with that criteria Sanjuro, Seven Samurai, The Shawshank Redemption, The Godfather, and A Close Shave.

Project 2 - 607

Ben Wolin

2024-10-06

Dataset 1: Laptop Dataset

Dataset 2: Development Indicators

Dataset 3: Movie Reviews